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An autoregressive process with Markov regime is an autoregres¬ 
sive process for which the regression function at each time point is 
given by a nonobservable Markov chain. In this paper we consider 
the asymptotic properties of the maximum likelihood estimator in 
a possibly nonstationary process of this kind for which the hidden 
state space is compact but not necessarily finite. Consistency and 
asymptotic normality are shown to follow from uniform exponential 
forgetting of the initial distribution for the hidden Markov chain con¬ 
ditional on the observations. 

1. Introduction. An autoregressive process with Markov regime, or Markov- 
switching autoregression, is a bivariate process {(Xfc,Yfc)}, where { X *,} is a 
Markov chain on a state space X and, conditional on {X^}, { Yy } is an 
inhomogeneous s-order Markov chain on a state space y such that the con¬ 
ditional distribution of Y n only depends on X n and lagged Y’s. The process 
{Xfc}, usually referred to as the regime, is not observable and inference has 
to be carried out in terms of the observable process {!&}. In general we can 
write a model of this kind as 

Y n — fd (Y n _i, X n , e n ), 

where {e^} is an independent and identically distributed sequence of ran¬ 
dom variables that we denote the innovation process (the e’s are not the 
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innovation process in Wold’s sense, however), Y& = (Y^, ..., Yfc_ s+ i) 

and {/e} is a family of functions indexed by a finite-dimensional parameter 
6. Of particular interest are the linear autoregressive models for which 

S 

fo (Y n _i, X n , e n ) — ^ ] o>i (X n , e n . 

i =1 

These models were initially proposed by Hamilton (1989) in econometric 
theory; the number of states of the Markov chain is in this context most 
often assumed to be finite, each state being associated with a given state 
of the economy [see Krolzig (1997), Kim and Nelson (1999) and references 
therein]. Linear autoregressive processes with Markov regime are also widely 
used in several electrical engineering areas including tracking of maneuvering 
targets [Bar-Shalom and Li (1993)], failure detection [Tugnait (1982)] and 
stochastic adaptive control [Doucet, Logothetis and Krishnamurthy (2000)]; 
in such cases the hidden state is most often assumed to be continuous. 

Nonlinear switching autoregressive models have recently been proposed in 
quantitative finance to model volatility of log-returns of international equity 
markets [see, e.g., Susmel (2000) and Chib, Nardari and Shephard (2002)]. 

A simple example of such a model (referred to as SWARCH for switching 
ARCH) is 

Y n — fg (Y n _i, A n )e n , 

where once again { X &} is either a discrete or a continuous Markov chain. An¬ 
other important subclass of autoregressive models with Markov regime are 
the hidden Markov models (HMMs), for which the conditional distribution of 
Y n does not depend on lagged Y’s but only on X n . HMMs are used in many 
different areas, including speech recognition [Juang and Rabiner (1991)], 
neurophysiology [Fredkin and Rice (1987)], biology [Churchill (1989)], econo¬ 
metrics [Chib, Nardari and Shephard (2002)] and time series analysis [de Jong and Shephard 
(1995) and Chan and Ledolter (1995)]. See also the monograph by MacDonald and Zucchini 
(1997) and references therein. 

Most works on maximum likelihood estimation in such models have fo¬ 
cused on numerical methods suitable for approximating the maximum likeli¬ 
hood estimator (MLE). In sharp contrast, statistical issues regarding asymp¬ 
totic properties of the MLE for autoregressive models with Markov regime 
have been largely ignored until recently. Baum and Petrie (1966) proved 
consistency and asymptotic normality of the MLE for HMMs in the partic¬ 
ular case where both the observed and the latent variables take values is 
finite spaces. These results have recently been extended in a series of papers 
by Leroux (1992), Bickel and Ritov (1996), Bickel, Ritov and Ryden (1998) 

(henceforth referred to as BRR), Jensen and Petersen (1999) (henceforth 
referred to as JP) and Bakry, Milhaud and Vandekerkhove (1997). BRR fol¬ 
lowed the approach taken by Baum and Petrie (1966) and generalized their 
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results to the case where the hidden Markov chain {X/,} takes a finite num¬ 
ber of values, but the observations belong to a general space. JP extended 
these results to HMMs with the regime taking values in a compact space, 
proving asymptotic normality of the MLE and a local consistency theorem. 

Around the same time, Le Gland and Mevel (2000) [see also Mevel (1997)] 
independently developed a different technique to prove consistency and asymp¬ 
totic normality of the MLE for HMMs with finite hidden state space. Their 
work was later extended to HMMs with nonfinite hidden state space by 
Douc and Matias (2001). This approach is based on the observation that 
the log likelihood can be expressed as an additive function of an extended 
Markov chain. These techniques, which are well adapted to study recursive 
estimators (that are updated for each novel observation), typically require 
stronger assumptions than the methods developed in BRR and JP. 

None of the theoretical contributions mentioned so far allows for au¬ 
toregression, but are concerned with HMMs alone. For autoregressive pro¬ 
cesses with Markov regime, the only theoretical result available up till now 
is consistency of the MLE when the regime takes values in a finite set 
[Krishnamurthy and Ryden (1998) and Francq and Roussignol (1998)]. In 
the present paper we examine asymptotic properties of the MLE when the 
hidden Markov chain takes values in a compact space, and we do allow for 
autoregression in the observable process. Our results include consistency 
and asymptotic normality of the MLE under standard regularity assump¬ 
tions (Theorems 1 and 4) and consistency of the observed information as an 
estimator of the Fisher information (Theorem 3 with 0* being the MLE). 
These results generalize what is obtained in the above-mentioned papers to 
a larger class of models, and we obtain them through a unified approach. 
We also point out that the convergence theorem for the MLE is global, as 
opposed to the local theorem of JP. Moreover, the nonstationary setting is 
treated in Section 7. 

The likelihood that we will work with is the conditional likelihood given 
initial observations Yo = (Yo, • ■ •, Y-s+i) and the initial (but unobserved) 
state Xq. Conditioning on initial observations in time series models goes back 
at least to Mann and Wald (1943). In our case we, in addition, also condition 
the likelihood on the unobserved initial state. The reason for doing so is that 
the stationary distribution of {(Xfc,Yfc)}, and hence the true likelihood, is 
typically infeasible to compute. Thus n, denoting the number of factors in 
the likelihood—the “nominal” sample size—is s less than the actual sample 
size. Using p as a generic symbol for densities we can express the conditional 
log likelihood as 

logp e {yi,...,y n \yo,xo) 


xi, ■ ■■ ,x n ,yi,... ,y n |y 0 ,x 0 Mdxi) • ■ ■ p(dx n ) 
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n n 


\ qe(x k -i,x k ) ge(yk\yk-i,x k )ix(dxi) ■ ■ ■ y(dx n ) 


k =1 k =1 


where y and ® (•, •) are a reference measure and the transition density for 
the hidden chain, respectively, and ge{yk\yk-ii x k) is the conditional density 
of given yfc-i and x^. In the particular case when { X *.} is finite-valued, 
taking values in {1,2,..., d} say, this log likelihood can be expressed as 



where Qg = {qg(i,j)} is the transition probability matrix of the Markov 
chain {Xk\, Gg(y |y) = diag(gg(y\y,i)), l Xo is the xoth unit vector of length 
d, that is, a d x 1 vector in which all elements are zero except for element 
xq which is unity, and 1 is a d x 1 vector of all ones. It is clear that (2) 
is essentially a product of matrices and is hence easily evaluated. It can be 
maximized over 9 using standard numerical optimization procedures or using 
the EM algorithm [see, e.g., Hamilton (1990)]. However, one should be aware 
that the log likelihood is typically multi-modal and either approach may 
converge to a local maximum. When {X*.} is continuous, evaluation of the 
log likelihood (1) requires an integration over an n-dimensional space. This 
task is insurmountable for typical values of n, and approximation methods 
are required. Two classes of such methods, particle filters and Monte Carlo 
EM algorithms, as well as a numerical example using the latter, are briefly 
discussed in Section 8. 

An obvious variant to our approach is to replace the condition of a fixed 
xo by assuming a fixed distribution for x$. Such an assumption does not 
change any of our results and no more than notational changes are needed 
in the proofs. A further natural variant is to maximize (1) w.r.t. 9 and the 
unknown xq. We have not included this approach in the present paper, pri¬ 
marily because score function analysis would require assumptions on how 
the maximizing xq varies with 9, assumptions that would be difficult to ver¬ 
ify in practice. We do remark, however, that in a particular but important 
case, assuming a fixed xq is no less general than is maximization over xq. 
Suppose that the regime { X *.} is finite-valued and that all elements q VJ of 
the transition probability matrix Q may be chosen independently. The pa¬ 
rameter vector 9 may then be written 9 = ((qij),ip). We also assume that 
ifj can be further decomposed as -*/> = (a, (Pi)), and that the functions g are 
such that ge(yk\yk-i,Xk) = h(y k \yk-V,Oi,P Xh ) for some family of densities h. 
In other words, all g 's belong to a single parametric class of densities, a 
is a parameter common to all regimes and the Pi s are the regime specific 
parameters. For example, in the linear regression case Pi may be the regime 
specific regression coefficients while a may be a common innovation vari¬ 
ance, a = Ee^. With this general structure it is clear that if xq is a fixed 
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initial state, for any model with a different initial state we can find an equiv¬ 
alent model with initial state xo by simply renumbering the states and then 
reordering the qij’s and /Vs accordingly. Therefore, whenever 6 is structured 
as above, assuming a fixed xq is no less general than is maximization over 
xq. 

As mentioned above, from a practical point of view the novelty of the 
present paper is that we extend the analysis of MLE asymptotics to wider 
class of models using a unified approach. From a theoretical point of view the 
novelty is, foremost, the geometrically decaying bound on the mixing rate of 
the conditional chain, X\Y, given in Corollary 1 and (20). This bound par¬ 
allels results of BRR (page 1622) and JP (page 521), but in contrast to those 
results our bound does not depend on the E’s being conditioned upon; it is 
deterministic. Assumption (Al)(a) below, implying that the hidden chain is 
uniformly geometrically ergodic, and more specifically that the whole state 
space is 1-small [see the comment after (Al)(a)], is crucial to this property; 
if {Afc} is m-small with m > 1 one can prove an analogous mixing rate 
bound using similar ideas, but the bound will then depend on the T’s. The 
deterministic nature of the bound is vital to our proofs that the conditional 
score given the “infinite past” [Afc jOO (0*) in Section 6.1] and the conditional 
Hessian given the “infinite past” (cf. Propositions 4 and 5) have finite second 
and first moments, respectively. The reason is that when the model contains 
autoregression, the conditional distribution of {Y/.} given { X is governed 
by an inhomogeneous autoregression rather than by independence; hence, 
in the proof of Lemma 10, for example, we cannot condition on the regime 
{Xk} and exploit conditional independence in order to turn a random mix¬ 
ing bound into a deterministic one as was done in BRR (e.g., page 1625) 
and JP (e.g., page 525). We plan to look into this more general case, but it 
lies outside the scope of the present paper. Another feature of the present 
paper is that by refining the arguments of BRR and JP we obtain almost 
sure convergence rather than convergence is probability in Theorem 3. 

The paper is organized as follows. Main assumptions are given and com¬ 
mented in Section 2, together with common notation. Then in Section 3 
we show that the regime {X^}, given the observations, is a nonhomoge- 
neous Markov chain whose transition kernels may be minorized using a fixed 
and common minorizing constant. This leads to a deterministic bound for 
its mixing rate. In Section 4, consistency of the MLE is considered under 
the additional assumption that {Y/,} is strict sense stationary; extensions 
to nonstationary processes through coupling are carried out in Section 7. 
Conditions upon which the parameters are identifiable are given in Sec¬ 
tion 5. Asymptotic normality of the estimator is studied in Section 6. The 
proof is based on a central limit theorem and a locally uniform law of large 
numbers for the conditional expectation of appropriately defined statistics. 
More specifically, these statistics are additive and quadratic functionals of 
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the complete data. Section 8 contains a discussion of numerical methods for 
state space models and a numerical example. Finally, the Appendix contains 
proofs not given in the main text. 


2. Notation and assumptions. We assume that the Markov chain {A/.} jY 0 
is homogeneous and lies in a separable and compact set A, equipped with a 
metrizable topology and the associated Borel u-field B(X). We let Qg(x,A), 
x G X, Ag B(X), be the transition kernel of the chain; the parameter 9 
which indexes the family of transition kernels as well as the regression func¬ 
tions for the Y’s, see below, is the parameter that we want to estimate. Next 
we assume that each measure Qg(x, •) has a density qg(x, •) with respect to 
a common finite dominating measure g on A. That is, for all 9 and iGA, 
Qg(x, •) -C fi. For the sake of simplicity, it is assumed that g(X) = 1; this 
assumption hints at applications where A is a totally bounded space. 

We also assume that the observable sequence {Yfc}fcL_ s +i takes values in 
a set y that is separable and metrizable by a complete metric. Furthermore, 
for each n > 1 and given {Y^. and X n , Y n is conditionally independent 

of {Y k r k -j-\i and {Afc}£ = g. We also assume that for each X n , Y ra _i and 
9 , this conditional law has a density gg(y |Y n _i,X n ) with respect to some 
fixed (j-finite measure v on the Borel er-field B(y). 

The parameter 9 belongs to 0, a compact subset of M p . The true param¬ 
eter value will be denoted by 9*, and when proving asymptotic normality of 
the MLE we assume that 9* lies in the interior of 0. Given the observations 
Y_ s +i, ... ,Y n of the process {Y^}, we wish to estimate 9* by the maximum 
likelihood method. 

The sequence = {(X^, Y/ c )}^ =0 is a Markov chain on Z = X x y s 

with transition kernel Tig given by, for any bounded measurable function / 
on Z, 


n 0 /(x,y s ,y s _i,...,yi) 

= / f(x',y',y s , ■ ■ ■ ,y 2 )qe{x,x')g e (y , \y s ,... ,yi,x')g(dx')v(dy'). 

JXxy 

We use in the sequel the canonical version of this Markov chain and put 
v = . For a probability measure ( on Z we let P# £ be the law of {Z n } 

when the initial distribution is £; that is, Zq ~ £. Furthermore, E# £ is the 
associated expectation. Many conditional probabilities and expectations in 
this paper do not depend on the initial distribution, and we stress this by 
then dropping the initial probability measure from the notation, so that P^ 
is replaced by P g, and so on. 

Throughout this paper we will assume that the transition kernel has 
a unique invariant distribution n-g\ this assumption is further commented 
on below. For a stationary process we write P# and for P g^ e and Ee )7rs , 
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respectively. We can and will extend such a stationary process {Zk}^L 0 to a 
stationary Markov chain {.Zfc}2T_ with doubly infinite time and the same 
transition kernel. 

For i <j, put Y{ = (Y ij Y i+ i,...,Y j ) and = (Y;, Y m ,..., Yj), respec¬ 
tively. Similar notation will be used for other quantities. For any measur¬ 
able function / on (A, B(X), p), esssup / = inf{M > 0 : fi({M < |/|}) = 0} 
and, if / is nonnegative, ess inf / = sup{M > 0: fi({M > /}) = 0} (with ob¬ 
vious conventions if those sets are empty). For the sake of simplicity, in¬ 
stead of writing esssup/ or ess inf/, we use the notation sup / or inf/. 

For any two probability measures pi and ^2 we define the total variation 
distance ||/H — /^Htv = sup^ |/ 11 (A) — / 12 (A)] and we also recall the identi¬ 
ties supiy^! |/ii (/) - /i 2 (/)| = 2 \\m - /z 2 ||tv and sup 0 < / < 1 |/ii (/) - /i 2 (/)| = 

||/ii — /x 2 11 tv - For any matrix or vector A, ||A|| = J2 \A%j\- Finally, we will use 
the letter p to denote densities w.r.t. the probability measure on B(X x 3^)® z 
whose finite-dimensional distributions are given by (// <2> v)® r for all r > 1. 

We now list our basic assumptions. 

(Al) (a) 0 < <t_ = infflee mi XjX < eX q 0 {x, x') and a + = sup 0e0 sup Xj!C / eAr qo(x, x') < 
00. 

(b) For all y' e Y and y 6// 0 < mfg e ef x gg(y'\y,x)fj,(dx) and 
swPezQ J x ge(y'\y,x)v(dx) <00. 

Assumption (Al)(a) implies that for all x E X, Q(x , A) > a_y(A) where y is 
a probability measure, that is, the state space X of the Markov chain {X n } 
is 1-small [Meyn and Tweedie (1993), page 106, with m = 1]. Thus, for all 
6 G 0, this chain has a unique invariant measure 7 and is uniformly ergodic 
[Meyn and Tweedie (1993), Theorem 16.0.2(v)]. When the state space is 
finite, (Al)(a) is equivalent to saying that for all x,x' G X, inf^g© qg(x, x') > 

0. 

(A2) For all 0 6 0, the transition kernel lie is positive Harris recurrent and 
aperiodic with invariant distribution ng. 

That the chain is positive means, essentially, that it is irreducible and has 
an invariant distribution [Meyn and Tweedie (1993), page 230] and Harris 
recurrence means that any nonnull set will be infinitely visited by the chain 
irrespective of where it starts within the set [Meyn and Tweedie (1993), 
page 200]. This assumption is rather weak; results on ergodicity for au¬ 
toregressive processes with Markov regime can be found in, for example, 
Francq and Zakoian (2001), Holst, Lindgren, Holst and Thuvesholmen [(1994), 
page 495] and Yao and Attali (2000). It implies that for any initial measure 
A [see Meyn and Tweedie (1993), Theorem 13.3.3], 
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so that the tail cr-field of {Z k } is trivial [Lindvall (1992), Theorem III.21.12], 
Its invariant cr-field, which is no larger, is thus also trivial and hence { Z *.} 
is ergodic in the measure-theoretic sense of the word. 

For the developments that follow, an additional assumption is needed. 

(A3) b + = sup 0 supy Q yi :r 50 ( 2 / 1 1y 0 ,T) < 00 and E 0 *(| log6_(Y 0 ,Y 1 )|) < 00, 
where 6_(y 0 ,yi) = inf<? f x gg(yi\y 0 ,x)n(dx). 

Remark 1. In the sequel we consider conditional expectations of ran¬ 
dom variables w.r.t. the cr-algebra generated by (X^, YJJJ for some m<n. 
Such expectations are defined up to a Pg ^-null set. For the derivations that 
follow, we need to specify a version of these conditional expectations. Since 
Pfl,c is defined by the initial distribution £ and the transition kernel Ilg, it is 
always possible to express these conditional expectations in terms of these 
quantities and we always implicitly choose this version of the conditional 
expectations. 

3. Uniform forgetting of the conditional hidden Markov chain. By the 

conditional hidden Markov chain we mean the process { X k } given a se¬ 
quence of Y’a. It will turn out that this process is a Markov chain, although 
nonhomogeneous, but still having a favorable mixing rate. Bounds on this 
mixing rate will be instrumental in the forthcoming development. 

Lemma 1. Assume (Al). Let m,n£Z with m<n and 9 E0. Undergo, 
conditionally on Y m , {X k } k > m is an inhomogeneous Markov chain, and for 
all k> m there exists a function y k (y k _ s , A) such that: 

(i) for any A E B{X), y k _ s l— ► hk(y k - s ^) is a Borel function; 

(ii) for any y k _ s , Hk{ y k - s ,-) i s a probability measure on B(X). In addi¬ 
tion, for all yf._ s it holds that /ifc(y(!_ s , •) •C // and for all Y n m , 

inf P e(X k € A|A fc _! = xf Y^) > —pi k {Y n k _ s ,A). 

xex <7_|_ 

Remark 2. Contrary to JP, this minorization condition involves a con¬ 
stant cj_/(t + which does not depend on the values of {Y k }. On the other 
hand, the minorizing measure / i k (y k _ s , ■) does depend on y k _ s whereas the 
minorizing measure is fixed in JP. Hence no assumption on the conditional 
density of Y k given past observations and hidden state variables is needed, 
whereas JP assumed a moment condition, in the special case of HMMs, for 
the ratio sup 0 sup^, x t gg(y\x)/gg(y\x'). An explicit expression for p- k (y k _ s ,-) 
is not needed. 
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Proof of Lemma 1 . The proof is adapted from Del Moral and Guionnet 
(2001) [see also Del Moral and Miclo (2000)]. The Markov property implies 
that, for m < k <n, 

¥ e (X k € A\X. k -\Yl) = F 0 (X k € AlX^Yl-i)- 

For k > n we have P e(X k € A|X^ _1 , Y^J = Q 0 (X k _i,A). This shows that 
{X k } k >m conditional on Y m is an inhomogeneous Markov chain. For k <n 
it holds that 

FeiXkeAlX^Yl^) 


where 


= / qe{X k _ l ,x)p e {Yl\X k = x,Y k _ l )n(dx) 


x( / qe(X k _ 1 ,x)p e (Y k \X k = x,Y k _ 1 )n(dx) 


lx 


-i 


( 4 ) 


Po( Y%\X k =x k ,Y k _ 1 ) 

/ n n 

II qe(xi-i,Xi) J|g e (Tj|Yi_i,Xi)/i® (Tl ^ fc) (dxfc +1 ). 
i=k +1 i=k 


Since <r_ < qg(x,x') < cr+ it readily follows that 

¥ e (X k € AlX^Yl-i) > —Hk(Y%_ s ,A) 

a + 

with 

fik(Y%,A)± I p e (YZ\X k = x,Y k - 1 ) fi (dx)/ [ Po(Y%\X k = x,Y k - 1 )n(dx). 


ix 


Note that 


ix 


Pe{yk\ x k = x k ,Y k _ 1 )ii(dx k ) 


= I II ( le(xi-i,x i )'[[ge(Yi\ x 'i-i,Xi)n^ n k+1) (d*l) 


i=k -\-1 
n 


i=k 


>a n L k \\ go(Yi\Y i_i,x)ii(dx) 

i=k^ 

is positive under (Al)(b). For k > n we simply set n k (Y k _ s ,A) = n(A). □ 


The a posteriori chain thus also admits X as a 1-small set. It is worthwhile 
to note that, despite the chain being nonhomogeneous, the same minorizing 
constant can be used for all kernels, irrespective of the Y’s the chain is 
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conditioned upon and of the parameter value. Using standard results for 
uniformly minorized Markov chains [see, e.g., Lindvall (1992), Sections III.9— 
11], we thus have the following result, which plays a key role in the sequel. 

Corollary 1. Assume (Al). Let m,n £Z with m<n and 9 £ 0. Then 
for all k>m, all probability measures p\ and p 2 on B(X) and all Y m , 

( F e (X k £-\X m = x,Y n m )g l {dx) - I F e (X k £-\X m = x,Y n m )p 2 (dx) 

Jx Jx tv 

< p k - m , 

where p = 1 — cr_/<7 + . 

Note that when m is positive, Fg(X k £ -|X m = x,Y^ n ) = Fg{X k £ •] X m = 
x,Y m ) does not depend upon the initial distribution. 


4. Uniform convergence of the likelihood contrast function. Given xo £ 
X , notice that 

~ n 

(5) pe(Yf |Y 0 , X 0 = x 0 ) = Y[qe(x k -i,x k )g e {Y k \Y k _i,x k )g® n (dx. r {) 

J k =l 

and define the conditional log likelihood function 

n 

(6) l n (9,x 0 ) A logp 0 (Y?|Yo,Xo = x 0 ) = ]T logp fl (Yfc|Y$ _1 ,Xo = x 0 ), 

k =1 


where p e (Y k |Y$ \x 0 = x 0 ) = p e ( Yf|Y 0 ,X 0 = xo)/p 0 (Y^- 1 |Y o ,X o = x 0 ). 
With the notation introduced above, for k > 1, 

Pe (Y k \Y k ~\x 0 = X0 ) 


(7) 


= 9e(Y k \Y k _i,Xk)qe(xk-i,x k ) 


xP 0 (X fc _ ie dx fc _i|Yj 1 


X 0 = x 0 )p(dx k ); 


_ k — \ 

here Pe(Xfc_i £ -|Y 0 , Xo = xo) is the filtering distribution of the unknown 

_ k — 1 

state X k _i given Y 0 and Xo = xo- Note that this distribution may be 
expressed as 

(8) P 0 (X fe _! £ -|Y k ~\x 0 = x) = J P e (X fc _! £ ■\Y k - 1 ,X 0 = x 0 )6 x (dxo). 

The discussion in the previous section hints that the influence of the initial 
point Xq vanishes as k —■> oo. 
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The definition of the conditional log likelihood employed here differs from 
the one usually adopted for HMMs. Extending to AR models with Markov 
regime the definitions of BRR and JP for example, the log likelihood would 
be 

n 

(9) *n(0) = £logP*( y *l Y S _1 )> 

k =1 


where 


( 10 ) 


rk-U 


p g (Y k |Y 0 )= / / go(Y k \Y k _i,x k )qg(x k -i,x k ) 


x P e{Xk-i e dx k _ i|Yq l )n(dx k ) 


_ _1 

Here E -|Y 0 ) is the filtering distribution of the unknown state 

_ k — l ' t _ 

X k _i given Y 0 under the stationary probability Pg. This filtering distri¬ 
bution may be expressed as 

€ -lYj" 1 ) = j P,(A fc _! E •|Yj” 1 ,X 0 = x 0 )¥ e (X 0 E ^olYj" 1 ) 

( 11 ) 

and Corollary 1 shows that that the total variation distance between the 
filtering probabilities Fo(X k _i E -|Yq *) and P6i(Xfc_i E -|Yo 1 ,Xq = xq ) 
tends to zero exponentially fast as k —^ oo uniformly w.r.t. Xo- 

The definition of the log likelihood in (9) is useful for HMMs but less so for 
models with autoregression. Indeed, for many models pe{Y k |Yq -1 ) cannot 
be expressed in closed form, basically because the smoothing probability 

_ _Ai—1 

P^Yo E ■ | Y 0 ) depends upon the stationary distribution ttq of the complete 
chain, 

P e (X 0 E A|Y$ -1 ) 

= / A 7r g (dao|Yo)/n-=i ge(a«-i,gi)ge(Yi|Yi_i,a i )/x®( fc - 1 )(dx^~ 1 ) 

f x Tre(dx 0 \Y 0 ) fll^ q e (x i - 1 ,x i )gg(Y i \Y i - 1 ,x i )fjL®( k - 1 '>(dyLi- 1 ) 

In many models for which the stationary density is not available in closed 
form, the log likelihood (9) does not lead to a practical algorithm. This 
is our motivation for considering the conditional form (6) of the log likeli¬ 
hood function. Nevertheless, as we will see below, for any initial point xq , 
n~ 1 (l n (6,x o) — l n (G)) converges to zero uniformly w.r.t. to 9 E 0 as a con¬ 
sequence of the uniform forgetting of the conditional Markov chain. Thus, 
by the continuity of the argrnax functional, d U}X0 , the maximum of l n (9,x o), 
and 9 n , the maximum of l n (9), are asymptotically equivalent. 



12 


R. DOUC, E. MOULINES AND T. RYDEN 


Remark 3. For £ an arbitrary probability measure on B(X) it is possible 
to consider 



Y?|Yo,Xo = xoK(cM. 


P^(Y?|Y 0 ) 


That is, instead of choosing an initial point Xq = x$ we set instead an arbi¬ 
trary initial distribution. There is in general little rationale for doing that, 
but the results obtained below for a fixed initial condition Xq = xo imme¬ 
diately carry over to this more general context. Typically such a £ has a 
density w.r.t. p so that there are a density p£»^(Y”|Yo, Xq = xo) and an 
associated MLE. 

The consistency proof for the MLE follows the classical scheme of Wald 
(1949), which amounts to proving that there exists a deterministic asymp¬ 
totic criterion function 1(9) such that n~ l l n (0,x o) —► 1(9) Pe*-a.s. uniformly 
w.r.t. 6 £ 0 and that 9* is a well-separated point of maximum of 1(9). It 
should be stressed that the asymptotic criterion 1(9) should of course not 
depend on the initial point Xo = xo- 

The first step of the proof thus consists in showing that the normal¬ 
ized log likelihood function n~ l l n (9 ,xq) converges to 1(9) uniformly w.r.t. 
9. This requires a uniform (w.r.t. 9 £ © and xo £ X) law of large num¬ 
bers. We first show that the difference between the conditional log likeli¬ 
hood functions l n (9,x o) and l n (9) stays within some deterministic bound, 
and hence n -1 (l n (9, xq) — l n (9)) tends to zero Pe*-a.s. and in L 1 (¥q*) [see 
Del Moral and Miclo (2001) for similar results]. 

Lemma 2. Assume (Al) and (A2). Then, for all xq £ X, 


sup \l n (0, x 0 ) - l n (9 )| < 1/(1 - p) 2 , P g*-a.s. 


060 


Proof. Note that by Corollary 1, (8) and (11), 


HP^-Xfc-! G -|Y$ \Xq = X 0 ) — Fg(Xk-l £ -|Yo 1 )||tv < P fc_1 - 


This implies that, for k > 1, 


\pe(Y k |Y$ \x 0 = x 0 )-p e (Y k \Y k 0 X )| 


// 


g e (Y k \Y k ^ 1 ,x k )qe(xk-i,Xk)p(dx k ) 


x (F e (dx k _ 1 \Y k 0 1 ,Xq = x 0 ) — Pe(dxfc_ 1 |Yq x )) 


<p k 1 sup ge(Y k \Y k -i,x k )q e (x k - l ,x k )p(dx k ) 


/ 
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In addition, by (7), 

Pe{Yk\Y k o~ l ,X Q = x 0 ) > cr_ jg e (Y k \Y k _ 1 ,x)y(dx), 

_ k—\ 

and the same inequality holds for pg(Y k \Y 0 ). The inequality |logx — 
logy| < \x — y\/(x A y) now shows that 

\logp e (Y k \Yo~\x o = x o )-logp 0 (Y k \Y^ 1 )\ < 

A summation concludes the proof. □ 

The next step consists in showing that n~ l l n {9 ) can be approximated by 
the sample mean of a Pg*-stationary ergodic sequence of random variables 
in L 1 (Pg*). It is natural to approximate n~ 1 l n (9) = n~ l YX=i ^°SPe(Xk\ Y 0 X ) 
by n _i YX=\ l°gP 0 (^fcl Y_ oc ), provided we can give meaning to the latter 
conditional densities. This is the main purpose of the construction below. 
Let, for x G A, 

A k , m , x (9) = \ogp e (Y k \Y k r^,X_ rn = x), 

Ak,m{9) — logpe(Yfc|Y^) = log Jp e (Y k \Y^,X- rn = x- rn )F e (dx- m \Y k rJ; 

It follows from the definitions that l n (9) = YPk,= \ Afc,o(0)- In order to show 
that for any k > 0 the sequences {A krn (9)} m >o and {A kjmtX (9)} m >o con¬ 
verge uniformly w.r.t. 0G0, Pg*-a.s., we prove that they are uniform Cauchy 
sequences. This property is implied by the following lemma. 

Lemma 3. Assume (A1)-(A3). Then for all k > 1 and m,m! > 0, Pg»- 
a.s., 

( 12 ) sup sup \A k ^ m ^ x (9) — A ktrn > iX i(9)\ < /3 fc +( mAm ')^ 1 /(i _ p) ) 

x,x' 

(13) sup sup IA k) m, x {9) - A fcim ((9)| < p fc+m_1 /( 1 - p), 
eeQx&x 

(14) sup sup sup | A k ^ x (9 ) | < max (| log b + |, | log (a_ 6 _ (Y fe _ i, Y k )) |). 

0 £© m >0 x£X 

The proof is similar to the proof of Lemma 2 (making use of the uniform er- 
godicity of the conditional chain) and is given in the Appendix. By (12), 
{Afc jm! 3 ;(0)} m >o is a uniform Cauchy sequence w.r.t. 9 E 0 P 0 *-a.s. and 
thus A k ,m, x {9) converges uniformly Pg*-a.s. Equation (12) also implies that 
lim^^oo X k , m}X (9) does not depend on x. Denote by A k ,oo(9) this limit. In¬ 
tuitively, we may think of A kj00 (9) as logpo(Y k | Y^T^)■ Equation (14) shows 
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that {Afc ]miX (0)} m >o is uniformly bounded in L 1 (Pg*), and thus the limit 
Afc )00 (^) is also in L^Pg*). Note that {Afc ]OO (0)} is a Pg*-stationary ergodic 
process. 

Setting m = 0 in (12) and letting m! —> oo shows that, Pg*-a.s., 

sup |A fci o )X (0) - A fc)OO (0)| < p fe-1 /( 1 - p), 
see 

and (13) shows that supg g Q |Afc iO)a: (0) — A/ Ci0 (6 , )[ < p k ~ 1 /( 1 — p). These two 
relations readily imply the following result. 

Corollary 2. Assume (Al) and (A2). Then 

n 

^2 sup | Afc ;O (0) - Afc )OO (0)| < 2/(1 - p) 2 , p 0 *-a.s. 

fc=i^e 

Corollary 2 shows that n^ 1 l n (9) can be approximated by the sample mean 
of a stationary ergodic sequence, uniformly w.r.t. 8 e 0. Since Ao jO o(0) £ 
T 1 (Pe*), the ergodic theorem implies that n~ 1 l n (0) —> 1(9) =Ej.[Ao i00 (S)] 
Pg*-a.s. and in iA 1 (Pgt*). Combining this result with Lemma 2 yields the 
following. 


Proposition 1. Assume (A1)-(A3). Then for all xq E X and 0 E 0, 
lim n~ 1 l n (9,xo) = 1(9), P g*-a.s. and in L 1 ) Pg*). 

Remark 4. The pointwise convergence of n~ l l n (9,x q) has been estab¬ 
lished for HMMs by Leroux (1992) and Le Gland and Mevel (2000) for a 
finite state space and later for a compact state space by Douc and Matias 
(2001). In the papers of Le Gland and Mevel (2000) and Douc and Matias 
(2001), the authors used the geometric ergodicity of an extended Markov 
chain consisting of the hidden variable, the observed variable and the pre¬ 
diction filter density function. However, this approach requires conditions 
stronger than the weak ergodicity condition (A2) and the moment condition 
(A3). 


The next step of the proof consists in showing that 1(9) is continuous 
w.r.t. 9. To that purpose, first observe that, by (14) and the dominated 
convergence theorem, for any x E X and 9 E 0, 


1(8) = E fl . 


lim Ao m, x (8) 
m—>oo 


lim E 0 *[A o , m ,x(/9)]. 

m— kx> ’ ’ 


Since {Ao im! z(0)} m >o is a uniform Cauchy sequence Pg*-a.s. which is uni¬ 
formly bounded in L 1 (Pg.) (Eg*[sup m>0 supg g @[Ao, mia; (0)]] <oo), it suffices 
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to show that Ao , m ,x(9) is continuous w.r.t. 9. In fact, this is the whole point 
of using Ao, m) a -,(9) instead of Ao, m ($). We will need the following additional 
assumption: 

(A4) For all and all (y, y') &y s xy, 9 qg(x, x') and 9 i—>• gg(y'\y, x) 

are continuous. 


Lemma 4. 


Assume (A1)-(A4). Then for all 9 £ 0 

sup |A 0;OO (6> / ) — A 0iOO (6»)| 
\0’-e\<s J 


lim 

< 5^0 


= 0 . 


The proof is given in the Appendix. We may now state the central result 
of this section, the uniform convergence of the normalized log likelihood 
n~ 1 l n (9,x o) to 1(9), which follows almost immediately from Corollary 2 and 
the ergodic theorem. 


Proposition 2. Assume (A1)-(A4). Then 

lim sup sup \n~ 1 l n (9,xo) — 1(9) | =0, Eg*-a.s. 

n—> °° 060 xq£X 


Again, the proof is in the Appendix. 


5. Consistency of the maximum likelihood estimator. We will now prove 
that under suitable assumptions the unique maximizer of 9 i—> 1(9) is 9* , the 
true value of the parameter. Let Eg be the trace of Eg on {T N ,£>(T)® N }, 
that is, the distribution of {Y*.}. Consider the following assumption: 

(A5) 9 = 9* if and only if 

(15) =%. 


In other words, under this assumption the stationary laws of the observed 
process associated with two different values of the parameter do not coincide 
unless the parameters do. This is obviously the minimal assumption that we 
can impose. When it comes to applying the results, it is sometimes more con¬ 
venient to consider the following alternative identifiability condition, which 
in certain circumstances proves easier to verify. 

(A5') 9 = 9* if and only if 


E@* 



jMYfjYo) - 

p*(Y?|Y 0 ) . 


= 0 


for all p > 1. 


(16) 







16 


R. DOUC, E. MOULINES AND T. RYDEN 


In fact we will see below that under (A1)-(A3), these two conditions are 
equivalent. Of course neither of the identifiability assumptions stated above 
is entirely satisfactory, because both conditions implicitly make use of the 
stationary distribution of the complete chain, which typically is infeasible 
to compute. Nevertheless, it does not seem sensible to expect much simpler 
identifiability conditions based, say, on gg and qe alone. The usefulness of 
(A5 7 ) is revealed when conditioning on Yp, yielding 9 = 9* if and only if 


(17) 


Eg* 




Po«( YPjYp) 

p*(Y?|Y 0 ) 



= 0 


for all p > 1 . 


In this expression the inner expectation is a conditional Kullback-Leibler 
measure, and hence nonnegative. If equality holds in (17), the inner condi¬ 
tional expectation vanishes P^*°-a.s. This observation may in turn often be 
used to prove that 9 = 9*, using, for example, identifiability of mixtures of 
the family to which the densities gg(-\y,x) belong. A particular example in¬ 
volving linear regressions with normal disturbances and finite-valued regime 
is discussed in Krishnamurthy and Ryden [(1998), page 302]. Slightly differ¬ 
ent identifiability conditions are employed in Francq and Roussignol (1998). 

Before proceeding to the equivalence of (A5) and (A5'), some preparatory 
lemmas are needed. We will first show that the conditional density function 
w(Y{|Y?) (i < j < k < £) converges to the unconditional density function 
po(Y jr) when the gap k—j tends to infinity. This can be viewed as a kind of fi- 
mixing condition expressed directly on the conditional and the unconditional 
density functions, which is inherited from the ergodicity of the complete 
chain. 


Lemma 5. Assume (A1)-(A3) and fix k <£. Then 

lim sup \p e {Yi\Yl)-p e {Y e k )\=0 in Fq*- probability. 

The proof is given in the Appendix. 

The following lemma shows that (15) and (16) are equivalent. 


Lemma 6. Assume (A1)-(A3). Then (15) holds if and only if (16) holds. 


Proof. It obviously suffices to show the “if” part, so suppose (16) holds. 
The basic idea consists in inserting a gap in the range of variables. For p > 1 
and m > 0, write 


0 — Eg* 



pg*(Y? +m |Y 0 ) - 
pe( Y? +m |Y 0 ) . 


= Eg* 


W-(Yl'‘|Yf„ + + ”.W) ~ 

w(yt|yK,y„) J 


T Eg* 



Pg«(Y£ff|Y 0 ) - 

w(Yffi|Yo)J’ 
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The two terms on the right-hand side are expectations of Kullback-Leibler 
divergence functions and thus nonnegative, which shows that 


0 > Eg* 



pg*(Y^|Y 0 ) - 

Pe( Y£S|Y 0 ) * 


= Eg* 


pg*(Yfpy„ m ) - 
pg(Y?|Y_ m ) . 


= Eg* 



pg-(Y? = y?|Y_ m ) 
pg(Yf = y?|Y_ m ) 


Pe*(Y P ! = 


y?|Y_ m )^(dy?) 


Thus, for all m > 0, 


pg*(Yf|Y_ m )=pg(Yf|Y_ m ), Pg*-a.s. 


By Lemma 5, 

\Po* (Yf) — pe (Yf) | 

= lim |pg*(Yf|Y_ m )-pg(Y[|Y_ m )| = 0 in Pg*-probability, 

m—>oo x v -l i 

whence pg*(Y^) =pg(Y^) Pg*-a.s. The proof is complete. □ 

Proposition 3. Under (A1)-(A5), 1(0) < 1(6*) and 1(6) = 1(0*) if and 
only if 6 = 6*. 


Proof. By the dominated convergence theorem, 


(18) 


1 ( 6 ) = Eg* 


m lirno lo gpg(Yi|Y° m ) 


= lim Eg*[logpg(Yi|Y° m )] 

m —>oo 


= rr!^ Ee * t l ° gPe (^11' Y -m) I' Y 0 m]] ■ 


Hence 1(6*) — 1(6) is nonnegative as the limit of expectations of conditional 
Kullback-Leibler divergence functions and 6 * is a maximizer of the function 
6 1 ( 0 ). 

Now assume 1(6) = 1(6*). By Lemma 6 it suffices to prove that (16) holds. 
Note that for any k > 1 and m > 0, 


Eg*[logpg(Yf|Y! m )] = ^Eg*[logpg(Y 1 |Y° m _ i+1 )]. 


1=1 


n lirn o Eg*[logpg(Y^|Y! m )] = kl(6), 


Hence, by (18) 
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and for p + s < k + 1, 

0 = k{l(9*) -l(6))= lim Eg. 


log 


Pe*{ Yf|Y“ m ) 


Pe(Yl\Y u _ r 


> lim sup Eg* 


= lim sup Eg* 


log 


log 


Pe * 


C^fc-p+i I Yfe-pi Y r 


Pe( Y k k _ p+ 1 \Y k _ p ,Y[ 


-k 


Pe* (Y^| Y 0 , Yp_ fc _ m ) 


Pe( Yf|Y 0 ,Y?_J_ r 


The proof is concluded by letting k —> 00 and using Lemma 7. □ 


Lemma 7. 
lim sup 

fc —>00 m >fc 


Assume (A1)-(A3). TTien /or all p > 1 and all 6 E 0, 


Eg* 


log 


p fl >(Y?|Y 0l Y_^) 

P,(Y?|Yo,Y:^)J 


— Eg* 


l og ^(Y?|Yo) 
p fl (Yf|Y 0 ) J 


= 0 . 


The proof of this lemma is based on the mixing properties of the complete 
chain (see Lemma 5) and is postponed to the Appendix. 

We may now summarize our findings in the following theorem, which 
states the strong consistency of the conditional MLE. 

Theorem 1. Assume (A1)-(A5). Then, for any xq G X, lim,,^^ 0 n}XQ = 
6 *, Pe*-a.s. 


6. Asymptotic normality of the maximum likelihood estimator. Lemma 1 
and Corollary 1 are the basic tools for generalizing the results of BRR and 
JP. The pattern of the proof of asymptotic normality of the MLE is similar 
to that presented in these contributions, with two major differences. First, 
the geometric upper bounds are deterministic, which is a consequence of 
Lemma 1 and Corollary 1. Second, in this paper, the MLE is the maximizer 
of the conditional log likelihood l n (6,x 0 ), where xq is some fixed arbitrary 
point in X, whereas in BRR and JP it is the maximizer of the unconditional 
log likelihood l n (6). 

Not surprisingly, the proof of asymptotic normality requires some addi¬ 
tional regularity assumptions. Let \7g and Vg be the gradient and the Hes¬ 
sian operator with respect to the parameter 6, respectively. We will assume 
that there exists a positive real 5 such that on G = {6 G 0 : \ 0 — 6*\ < J} the 
following conditions hold: 

(A6) For all x,x' G X and (y ,y') £ R x the functions 6 1 —> qg{x,x') and 
0 1 —:► go{y'\y, x') are twice continuously differentiable on G. 
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(A7) (a) supegc-supj.^, ||V 0 log<? 0 (x,z / )ll < oo and sup 0eG sup^, ||V| x \ogq 0 (x,x') 


< 


oo. 


(b) Eg* [sup 6 / GG sup z 11 Ve log 50 (5A | Y 0 , x) 11 2 ] < oo and E 0 * [sup 0eG sup a . ||V| logY 0 , x) 
< oo. 


(A 8 ) 


(a) For v® ^-almost all (y ,y') in x y there exists a function 
/yy:A'->E + in L x {n) such that sup 0 gG 5 r 0 (y'|y,x) < fy, y '{x). 

(b) For /x( 8 )P-almost all (x,y) 6 X x T s , there exist functions ffy : T —► 

M + and f 2 ? : T -> M+ inL^z/) such that ||Ve^(y'|y,x)|| <’fl s {y') 
and ||V^(y , |y,x)|| < for all 9 G G. 

Remark 5. The regularity requirements (existence of first and second 
derivatives at all points, existence of integrable upper bounds) are reminis¬ 
cent of Cramer’s classical proof of asymptotic normality of the MLE. It is 
obvious that these conditions could have been weakened using more sophis¬ 
ticated techniques. We will nevertheless stick to the conventional proof. 

Remark 6 . The conditions are weaker and more easily checked than 
those used by JP, who assumed that the stationary density of the complete 
Markov chain is twice differentiable w.r.t. to 9 , a condition which is difficult 
to check except for very simple models. However, as seen below, by using 
proper conditioning techniques it is possible to avoid such assumptions. 

Asymptotic normality of the MLE is implied by: 

(i) a central limit theorem (CLT) for the Fisher score function gl n (9*, xo), 

and 

(ii) a locally uniform law of large numbers for the observed Fisher infor¬ 
mation —n~ 1 'Vgl n (9,x o) for 9 in a neighborhood of 9*. 

Along the lines of the proofs by BRR and JP, the key to the proof consists 
in finding proper expressions for these two quantities. Exploiting the hierar¬ 
chical structure of the model, it turns out that it is practical to express the 
score function and the observed Fisher information as functions of condi¬ 
tional expectations of the complete score function and the complete Fisher 
information. 


6.1. A central limit theorem for the score function. The Fisher identity 
[Louis (1982)] generally states that for a model with missing data, the score 
function equals the conditional expectation of the complete score given the 
observed data; the complete score is the gradient of the complete log like¬ 
lihood, that is, the likelihood that includes the missing data in addition 
to the observed data. The rationale for using this identity is that the log 
likelihood and score functions themselves are typically rather involved [cf. 
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(1)] while the complete log likelihood and score are simpler. This is true in 
our case, in which the Markov chain {X k }^ =1 constitutes the missing data. 
The Fisher identity requires exchanging the gradient operator with certain 
integrals, and is valid under (A7) and (A8). Hence, for any xq € X , 

n 

n- l / 2 V e l n (0*,x o) = rr 1 ' 2 E V 0 logp*. {Y k \Y k ~\ X 0 = x 0 ) 

k=1 
n 

= n~ 1 / 2 J 2 Afc,o,*o(0*), 

k= 1 

where for any x £ A and 0 £ 0, 


A fc , o ,x(0) A E 0 




2=1 


Y k 0 ,X 0 = x 


- E 0 


fc-i 


E^ Z E) 


2=1 


Yj \x 0 = ® 


with the convention JT =0 Cj = 0 if b < a and 

HOXi-i) = W,= HO, (XE, Yi-i), (Xj_ a+ i,Yj_ a+ i)) 
= ^(0,(X i _ 1 ,Y i _ 1 ),(X i ,Y i )) 

A v e log (© (A ? _!, W) (Yi I Yi-i, Xi)) 


is the conditional score of (Xj,Yj) given (A*_i, 
We also let, for m > 0, 


Aio, m W = E e 


k 

E 

2 =— 772+1 


- E^ 


k — 1 

e 

i=— m+1 


dfc-1 


Similar to what is done in BRR and JP we show that A k ^ x (9*) can be 
approximated in L 2 (P^*) by a Pg* -stationary martingale increment sequence 
and apply a CLT for sums of stationary martingale increments. 

The first step in the proof consists of showing that the initial point x does 
not show up in the limit. 


Lemma 8. Assume (Al), (A2) and (A6)-(A7). Then, for all x€X, 


lim Ee* 

n—> oo 


n ~ 1/2 E ( A k,o, x (0*) - A m (0 *)) 

k=1 


2 


= 0. 


















MLE FOR SWITCHING AUTOREGRESSION 


21 


Proof. Write 

n 

£ (A fci0 , x (n - A*, o (0*)) 

k =1 

n 

= £ (Eg*[</>($*, Z^_i)|Yg ,Xo = x] - Eg.^Zt_r)TO). 
fc=i 

Under the stated assumptions Eg* (sup T ga 110(0*; (*» Y 0 ), (a/, Yi))|| 2 ) < oo. 
The proof now follows from Corollary 1, which implies that 

||E e *[0(r ) zt 1 )|Y^,X o = x]-E 0 ,^(r,zt 1 )|Y”]|| 

<2 sup ||^*,(x,Y fc _i),(x , ,Y fc ))||p fc - 1 . 

X,x'^X '—' 


We will now show that for any k, {Afc, m (0*)} m >o is a Cauchy sequence 
in L 2 (Pg*). Since 


A fcim (0*)=E < ,.[^*,zti)|Y*J 

Z——772.—J— 1 


the difference A k,m{9) ~ A k,m'(9) (assuming m' > m> 0) involves for each 
—m <i <k terms of the form either Eg* [</>(#*, z£)|Y£j — E g* [4>(9*, z£)|Y* m z] 

or E@*[</>(#*,Z*_ 1 )| Y k _ m ] — Ee»[<l)(e*,Ti-i)\y k -m ]-By Corollary 1 and an ar¬ 
gument used to prove Lemma 3 we obtain for —m! < — m < i < k that, 
Pg»-a.s., 


(19) 


||Eg* [(j)(e*X-i)\^ k -m] - Ee* [<K9*,%-l)\Y-m>] 


<2 sup || </>(#*, (x, Y,_i), (x',Yi))\\p' 

x.x'EX 


1 


Note that this term is small when i is far from — m, say, i > (k — m)/2. 
Another kind of inequality is required to bound ||Eg* [<j)(9*, z£)| Y_ m ] — 

Eg, [0(0*,Zj_ 1 )|Y* m 1 ]||. This type of bound will follow from forgetting prop¬ 
erties of the reverse conditional hidden chain. Similar to Lemma 1, we have 
the following result. 


Lemma 9. Assume (Al) and (A2). Let m,n€Z with m,n>0 and 8 € 

0. UnderFg, conditionally on Y™ mJ the time-reversed process {X n _k} o<k< n +m 
is an inhomogeneous Markov chain, and for all 0 < k <n + m there exists a 
function /ffc(y^ fc _ s+1 , A) such that: 
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(i) for any A e B{X), y”^L s+1 ^ Mfc (Y-m-s+u a 5ore/ function; 

(ii) /or any y/^f_ s+1 , ftk(y n If£_ s+1 ,-) is a probability measure on B(X). 

In addition, for all y r ff/f_ s+1 , fik{ y"^L s+1 , ■) M an d for all Y” m , 

Pe(Y n _ fc € A|X n _ fc+1 ,'Y^ m ) = P e (X n _ fc g A|X n _ fc+1 ,‘Y^f) 

>^/i fc (Y/- fc _ s+1 ,A). 


The proof is along the same lines as Lemma 1 and is omitted for brevity. 

From this lemma, using an analogue of Corollary 1, it follows that for 
—m <i < k, 


( 20 ) 


■=~?k 


||E 0 .[^*,Z i _ 1 )|Y_ m ]-E fl . 


SZJ-rJlY* 1 


< 2 sup 

x,x'ex 


‘,(x,Y i _ 1 ),(x , ,Y i ))llA 


m J 

k—i—1 


By a standard martingale theory result [see, e.g., Shiryaev (1996), page 510], 
under assumption (A7) E 0 * [0(0*, Z]_ 1 )|Y ; ( m ] —> Eg* [0(0*, Z*_ x )|Y^,^], P^.-a.s. 
as m—> oo. Hence inequalities (19) and (20) hold true, Pg.-a.s., when either 
m or m! is replaced by oo. Using (20) with m = oo shows that 


E ll^[0(r,zi_i)|Y fc _J 


%——OO 


— Ee* [0(0*, Z*_i)|Y 


■fc-i. 


Ai—1 

< E 2 SU P ll0(^(uY*_ 1 ),(^Y l ))||/- i " 1 . 

i=-oo 

Under (A7) the right-hand side is in L 2 ( Pg*), and we may thus define 

A fc , oo (r)^E e ,[0(r,zti)|Y fe _ oo ] 

+ E (Ee*[ 0 ( 0 *, zj_i)|- Eg*[ 0 ( 0 *, zi_!)|Y^J]). 

l= — OO 

In addition we have the following L 2 -bound, showing that m (0*) con¬ 
verges to Afc ]OO (0*) in L 2 (Pg*) as m—> oo. 


Lemma 10. Assume (Al), (A2) and (A6)-(A7). Then, for all k > 1 and 

m> 0 , 

(M|A fc , m (0*)-A fcjOO (0*)|| 2 ) 1/2 


< 12 (e 9 * 


sup ||0(0*, (x,Y 0 ),(x , ,Yi))||- 

x,x'ax 


1/2 ( fc + m )/ 2—1 

1 -0 
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Proof. Using (19) and (20) and the Minkowski inequality, we find that 
apart from the factor (Eg. [sup^/g* || (x,Y 0 ), (x', hi))|| 2 ]) 1/2 , (Eg* || A feim (0* 
Afc, 0 o(6'*)|| 2 ) 1//2 is bounded by 


fe-i 


2pk+m- 1 +4 ^2 {p kil A p i+m ~ 1 ) + 2 E P 

Z— — 7T2.—|— 1 Z— — OO 


< 2 / 9 ' 


fc+m—1 


+ 4 E 


/-*-! +4 


—oo<z<(/c—m)/2 


E 

(/c—m)/2<z<oo 


p i+m "i + 2— 


fc+m—1 


< 12 


(fc+m) /2—1 


1-P 


1-P 


□ 


Now define the filtration T by JP). = cr(Yj; —oo <i <k) for k € Z. By the 
conditional dominated convergence theorem, 


Eg* 


E (Eg* [ 0 ( 0 *, zj_ 01 Y* «] - Eg* [<£( 0 *, zj_i) IY* ^]) 


Y fe_1 


= 0 , 


Eg* , zti)| yE 1 ] = Eg* [Eg* [0(r , zti)| yE, A fe _!]I yE] = 0, 

so that {Afc !00 (@’ ,t )}^ = _ 00 is an (J 7 , Eg* )-adapted stationary, ergodic and 
square integrable martingale increment sequence. The CLT for sums of such 
sequences [see, e.g., Durrett (1996), page 418] shows that 

n 

n~ 1/2 E A fcl oo(0*) - AA(0, 1(0*)), Eg*-weakly, 
k =1 


where 1(0*) = Eg* [A O! oo(0*)Ao,oo(0*) T ] is the asymptotic Fisher information 
matrix, defined as the covariance matrix of the asymptotic score function. 
Lemma 10 implies that 


( 21 ) 


lim Eg* 


n 


-1/2 V 


k =1 


(A fe ,o (0*) - A k)OO (0*)) 


= 0, 


and hence n -1 / 2 J2k =l Aj^o^*), and by Lemma 8 also n -1 / 2 A^o,^#*), 

have the same limiting distribution under Pg*. We summarize our findings 
in the following result. 


Theorem 2. Assume (Al), (A2) and (A6)-(A8). Then for any xo £ X 
n~ 1/2 V e l n (0*,x o ) —> J\f (0,1(0*)), P g*-weakly. 











24 


R. DOUC, E. MOULINES AND T. RYDEN 


6.2. Law of large numbers for the observed Fisher information. The sec¬ 
ond part of the proof consists of showing a locally uniform law of large num¬ 
bers for the observed Fisher information; for all possibly random sequences 
{0*} such that 9^—^9*, Pg*-a.s., — n” 1 Vg/ n (0*, xq) converges, Pg*-a.s., to 
the Fisher information matrix at 9*. Similar to what was done in the previ¬ 
ous section and following the ideas developed in BRR, the proof amounts to 
showing that —n" 1 VgZ n (0*,xo) may be approximated by the sample mean 
of an ergodic stationary process. To do that it is convenient, just as for the 
score function, to express the observed Fisher information in terms of the 
Hessian of the complete log likelihood. This can be done by using the so- 
called Louis missing information principle [Louis (1982)], valid under (A7) 
and (A8), which shows that 


Velogp^Yj’jYo, X 0 = x 0 ) 


( 22 ) 


= lEe 


iX^Zj-i) Yo,Y 0 = *o 


i= 1 


+ var<? 


J2<P(0X-i) Y n 0 ,X 0 = x 0 


i= 1 


where 


V>(0Xi-i) = <p(6,Zi-i,Zi) = tp(6, (XjlJ, Yi-i), (Xj_ a+ i, Yj_ s+ i)) 
= <p(9,(X i _ 1 ,Y i _ 1 ),(X i ,Y i )) 

— log (ge(Xj_ 1 ,Xj)^(Yj|Yj_ 1 ,Xi)). 


As above we may write these quantities as telescoping sums: 


Eg 


5>(0, Zf i _ 1 )Y^,X 0 = x 0 


. i=1 


= £ E » 


k=1 


E<^ Z E: 


i= 1 
rk-i 


Y k 0 ,x 0 = x 0 


— Eg 


E^> Z E) 

L i=1 


Yq 1 ,X 0 = x 0 


and 
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varg 


= Efvar e 


Y 0 ,X 0 = xo 


k =1 


yJ,Y 0 = x 0 


varg 


EMzj-i) 

2=1 

■jfe-l 

Y,<KeX i - 1 )Yl\x 0 = X0 


. 2=1 


It turns out (see Lemma 13) that as k —> oo the initial condition on Xq 
becomes irrelevant. Therefore it is sensible to define, for k > 1 and m > 0, 


A k,m(G) = Eg 


(23) 


E v(0X-i) 

i=— m+1 


Y k 

x —m 


— E e 


rfc, m (6>) = varg 


(24) 


E <P(0, %-i) 

. 2=—m+1 

E 

2 =—m+1 


Y fe_1 
1 —m 


Y k 


varg 


k _1 

E 

2 =—m+1 


Y^” 1 


Propositions 4 and 5 show that Afc+0) and Pfc !m (0) both have limits as 
m —> oo, Pg*-a.s., and in Z+Pg*). Let Afc ;OO (0) and Pfc )O o(0) denote these lim¬ 
its. It follows from the definitions above that {Aj. j0 o}^ = 1 and {r^ool^Li are 
Pg*-stationary and ergodic, and the limit of the observed Fisher information 
will be —Eg*[A O)OO (0*) + r O)O o(0*)]. 


Proposition 4. Assume (Al)-(A3). Let G be a compact subset of 0, 
let q > 0 and let (p:Q x X q x^->l be a Borel function such that for all 
x( G X q and yf G y q , (p(fl,xf,yf) is continuous w.r.t. 6 on G and 


Eg* 


sup sup |<p(0,x?,Yj)| 
L fleGx’ai 


< oo. 


Then for each 6 G G, A k,m(6)> as defined in (23), converges P g*-a.s. and in 
L l ( Pg*) to Afc ;00 ($) as m ^ oo. In addition , the function 0 i * Eg* 5 oo(^)] 
is continuous on G and for all x o G A and 9 G G, 


lim lim sup 

n ^°°\0'-B\<5 


n X E gi 


2=1 




i—q+lj 


Yq , Y 0 = Xq 
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— Eg* [A 0)C 


= 0, 


9 * -&• S. 


Proposition 5. Assume (Al)-(A3). Let G be a compact subset of 0, 
let q > 0 and let cfr.Q x X q x^-^M be a Borel function such that for all 
x{ S X q and y± S y q , f>(9 ,x.\,y\) is continuous w.r.t. 9 on G and 


E e . 


sup sup 1x c /, ) | 2 

.fleGx’e^ 


< oo. 


Then for each 6 gG, rfc, m (0), as defined in (24), converges P g *-a.s. and in 
L 1 ( Pf)*) to r fc , oo($) ^ oo . In addition, the function 9 i * E$* [Po,oo(^)] 

is continuous on G and, for all xq £ X and 9 £ G, 


lim lim sup 

5—>0 n—>oo |g/—0|<<5 



var^/ 


. i —1 


Yo , Xq = Xq 


-Pe*[r o ,oo(0)] 


= 0 , 


Pfl.-a.s. 


Note that in Propositions 4 and 5 the functions ip and (j) take values in 
M. Adaptations to vector- and matrix-valued functions are straightforward. 

For all xo £ X the Fisher information identity implies, under the stated 
assumptions, that 

e ln{0 ,Xo)\/ e ln{0 ,Xq) T \Y q, Xq = Xq] 

= -n~ 1 E g [V 2 g l n {9,xo)\Y 0 ,Xo = x 0 }, 

and Propositions 4 and 5 together with the Louis missing information prin¬ 
ciple show that the limits in n of these two quantities both coincide with 
the Fisher information at 9*. We conclude the discussion in this section by 
stating the main result. 

Theorem 3. Assume (A1)-(A3) and (A6)-(A8) and let {9*} be any, 
possibly stochastic, sequence in 0 such that 9^-^9* P g *-a.s. Then for all 
xqGX -n~ 1 V$l n (9*,x 0 ) -> 1(9*), P g *-a.s. 

The following theorem is a standard consequence of Theorems 2 and 3 
(see, e.g., BRR). 

Theorem 4. Assume (A1)-(A8) and that 1(9*) is positive definite. 
Then for all xq £ X 

n 1/2 (d ntX0 - 9*)-+ J\f(0,I(9*)~ 1 ), P g *-weakly. 
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7. Extensions to nonstationary AR models with Markov regime. In Sec¬ 
tions 4 and 6 the assumption of stationarity of {!&} plays a crucial role. In 
this section we shall extend the consistency and asymptotic normality of 
the MLE to the case where this process is not stationary. Hence we assume 
that the process we observe, denoted by {Y fc }^L 0 , and the associated hidden 
chain, denoted by {Xj(,}£L 0 , are governed by the transition kernel n^* and 
with (Xq, Y 0 ) having distribution (. This initial distribution is unknown to 
us and in general ( / 'Kq*. As before we let {(A*., Yfc)}£L 0 denote a corre¬ 
sponding stationary process. 

We observe that since these processes are positive Harris recurrent and 
aperiodic [this is (A2)] we can construct them on a common probability space 
in a way that there exists an a.s. finite random time T, the coupling time, 
such that Z n = z' n for n > T [Thorisson (2000), page 369]. The associated 
probability measure is denoted by P„- e »®f. Hence, to be precise, < 

oo) = 1. 

Define l' n {8 , xq) = logpe([Y , ]”| Y 0 , Xq = x 0 ) and let 9' n XQ be the maximizer 
of this function w.r.t. 0. Put 


D n (9, x 0 ) = l' n (8, x 0 ) - l n (9, x 0 ) 

n 

= E (logp e (Y^\[Y , ] k 0 - 1 ,X' 0 = x 0 )~logp e (Y k \Y k 0 -\x 0 = x 0 )). 

k =1 

The following lemma ensures that D n (9,x o) is bounded, P^^-a-s., which 
implies that the difference between 9' n XQ and 9 U)Xo converges to zero, 
a.s. (see Theorem 5). 

Lemma 11. Assume (Al) and (A2). Then for all ( and all x$ G X, 
sup n > o sup 0e0 \D n (9,x 0 )\ <oo, P n e *®t-a-s. 


Proof. Write 
sup | D n (9 , x 0 ) | 

0G0 


< Esup|logp0(Y fc '|[Y']£ 1 ,X'=x 0 )-logp e (n|Yj 1 ,X 0 = x 0 )| 

k= 


< E f su Pl lo gPe(ifcl[ Y, ]o 1 ,X^ = x 0 )\ 

fc=i Veee 


+ sup | logp6»(Yfc| Yq 1 ,X 0 = X 0 )| 
6»G0 


(25) 
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oo 


+ 51 sup|logp e (y fc , |[Y , ]g 1 

k=T +1 eee 


X' 0 = x Q ) 


-log Pe (Y k \Y k 0 \x 0 = x 0 )\. 


Since 


v-J 9e(Y k \Y k _ 1 ,x)p(dx) <pe(Y k \Y^ 1 ,X 0 = x 0 ) 

<0+ / ge{Y k \Y k _ 1 ,x)p(dx) 


(see the proof of Lemma 2), the first sum on the right-hand side is finite 

IP7r s *<g>C- a ' S - by ( A1 )- 

For the second sum, note that for all i < k, 


Pe(Y k \ Y k 0 \x 0 = x 0 ) 


g e (Y k \Y k ^ 1 ,x k )qe(x k - 1 ,x k )p(dx k )Fe{dx k --i\xi,Y 


■k- 1, 


X Fg(dXi\YQ 1 ,Xq = Xq), 


and similarly for po(Yf\[Y ]q 1 ,Xq = xo). Using the fact that for n > T, 
Z n = Z n and thus Y n = Y' n and Corollary 1, we have for all k> T, 

\po(Y k \ [Y'](r\ X' = xo) - Pe(Y k \Y k -\x 0 = x 0 )\ 

<p k ~ T ~ l a + J g e (Y k \Y k -i,x)p,{dx), 


and hence 

\logpe(Y^[Y , } k 0 - 1 ,X’ 0 =x 0 )-logpe(Y k \Y k ~ 1 ,X 0 = x 0 )\ 

(26) 

<p k - T ~ l /(l-p)- 

compare the proof of Lemma 2. Thus the second sum on the right-hand side 
of (25) is also finite P^.^-a.s. □ 

We now can prove the consistency of the MLE for a nonstationary process. 

Theorem 5. Assume (A1)-(A5). Then for all C, and any x o G df , lim n —>oo 

0 *, 


fit 

u n,x o 


-a.s. 
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Proof. Since 9' nxo is the maximizer of O^n 1 l' n (9,x o), 

l@ n , xo )>K9*)-l(e*) + n- 1 l n (0*,x o ) 

- n~ l l n (0 *, x 0 ) + n-H'JQ* , so) - *o) 

+ 71 ln{@n,x o>^o) ^ ^n{d n ^ XQ , Xq) + l(9 njXo ) 

> Z(0*) — 2sup|n~ 1 Z n (#,xo) — Z(0)| — 2sup|n _1 H n (0,xo)|. 
eee e»ee 

The right-hand side of this inequality tends to 1(9*), P^ig^-a.s., by Proposi¬ 
tion 2 and Lemma 11. The proof now follows from Proposition 3, continuity 
of 1(6) (Lemma 4) and compactness of 0. □ 

To show that n l ^ 2 (0' nXQ — 0 n ,x o ) —> 0, P 7 r e *(g><-a.s. and thus that 0' nx and 
9 n xo are asymptotically normal with the same covariance matrix, we need 
to show some kind of continuity of the function 6 > D n (9,x q). 

Lemma 12. Assume (A1)-(A5). Then 

,} [n £o\ Dn ^n,xo’ XO ) D n (0 nj xo,%o) \ = 0) PtT0*®C -U.S. 

Proof. Let e > 0. By (26) there exists a random integer N which is 
finite P^.^-a.s. and satisfies 

OO 

J2 sup | logp e (Yj'\[Y'} 1 ^- 1 ) - logPf?(T fc |Yj -1 )| < e, P^^-a.s. 
k=N+l 9ee 

Thus, P 7r9 »ig ) ^-a.s. for all n> N, 

\Dn(9 nt x 0 , Xq) — D n (0 HjX 0 ,Xo)| 

+ 2s + \l]y(9 n Xo , Xq) — l]y(9 nt x 0 , %o) I + \^N(0 n ,xo , Xq) — In(9ti,xo > ^o) I • 

Under the given assumptions 6 V n (0,xq) and 0 i—> 1^(0,xq) are P^,®^- 
a.s. continuous (see the proof of Lemma 4) and the proof is complete upon 
observing that 9 n>x0 and 0 n ,xo both converge to 0*, P^gg-a-s., and that e 
was arbitrary. □ 

Theorem 6. Assume (A1)-(A8) and that 1(0*) is positive definite. 
Then, for all £ and any xq G X , 


n 1 / 2 (Co-n-AA(o,/(r)- 1 ), 


Ptt 9 *®c -weakly. 
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Proof. It is sufficient to prove that e n = \/n{Q n , x 0 — 0' n XQ ) —> 0, 

a.s. Since Q' n xo is the maximizer of 9 e-s- l' n (9, xo), l' n {Q'n,x 0 i x o) >l'n(@n,x 0 , x o)j 
which implies that 


Dn{9 nx 0 ,X(}) D n {d n ^ Xol Xo) > ln{9 n ^x 0 , Xq) ln(9 rljXo ,Xo) 

— 2^ ^9^n(tn,9 n Xo T (1 tn)9n,xo)^n 

for some 0 < t n < 1. By a straightforward adaptation of Theorem 3 to the 
present case with two processes, 


+ (1 - tn)e n ,x o) - 1(9*), 


■ 7T0*C 


-a.s. 


Since 1(9*) is positive definite there exists M > 0 such that on a set with 
-probability one and for n sufficiently large, 


Dn(9 n ,xo’ X o) D n (9 n,x o? xo ) > M\e n \ 2 


The proof is complete by applying Lemma 12. □ 


8. Numerical approximations. 

8.1. Two Monte Carlo numerical methods. As mentioned in the Intro¬ 
duction, when the state space of {Xk\ is continuous the log likelihood needs 
to be approximated by some numerical method. Here we list two classes 
of methods that have been proposed and successfully used in many practi¬ 
cal problems, but point out that there are other ones as well, for example, 
importance sampling [Geyer and Thompson (1992) and Geyer (1994)]. 

Particle filters. These methods depart from the representation 

n r. 

l n (9,x 0 ) = J2 l °g / 9e(Yk\Y k -i,x k )F e (X k G dx k \Y k 0 1 ,X Q = x Q ) 
k =1 J 

and replace the predictive distribution P#( X k £ dxk | Y 0 , Xo = xo) by a par¬ 

ticle approximation. More precisely, the approximating distribution is the 
empirical distribution of the locations of N particles at time k. There are 
many variants to how the locations of the particles are updated, and under 
general assumptions the particle approximation converges to the true pre¬ 
dictive distribution at rate !V -1//2 when N grows. The approximate log like¬ 
lihood may be maximized using any standard numerical optimization algo¬ 
rithm. Further reading is found in the collection Doucet, de Freitas and Gordon 
(2001); see in particular the survey paper Hrirzeler and Kiinsch (2001). Other 
references are Kiinsch (2001) and Pitt (2002). Particle filter methods have 
been proved to perform well in a wide range of problems, as illustrated in 
the above references. 
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Monte Carlo EM algorithms. The EM algorithm is an iterative algorithm 
for computing the MLE (or at least a local maximum of the log likelihood) 
in problems with missing data. Its key components are the computation of 
the function Q(9,9') = E^logp^Xj 1 ,Yf|Yg,X 0 = xo)|Yg,X 0 = xo] (the E- 
step) and the maximization of this function w.r.t. O' (the M-step). These 
two steps constitute the update from a current estimate 9 to a new one. 
Obviously the EM user is required to compute conditional expectations of 
functions of X” given Y 0 and X 0 = xq. If the state space is continuous 
this task is typically infeasible, but the conditional expectations can be re¬ 
placed by sample averages over m simulated realizations of X" under the 
same conditions. These methods are called Monte Carlo EM (MCEM) al¬ 
gorithms, or stochastic EM (SEM) algorithms. A recent survey is found in 
Booth, Hobert and Jank (2001), and general versions of the algorithm are 
described in Tanner (1996) and Nielsen (2000). If the number m of simu¬ 
lated replications is allowed to increase with each iteration, the algorithm 
can be made to converge [Fort and Moulines (2003)]. MCEM methods are 
successfully used in many areas; see the above-mentioned survey paper. 

Having said that, we stress that the distinction between particle Liter and 
MCEM methods is not sharp. In fact, the function Q{9 , 9') of the EM algo¬ 
rithm can, in principle, be computed recursively in n, which opens up for 
particle approximations of this functional [Cappe (2001)]. Hence, the approx¬ 
imation and maximization of the log likelihood rather splits into two other 
subproblems to be considered. First, the optimization scheme: (i) EM type, 
which is particularly appropriate if the complete data is from an exponential 
or curved exponential family of distributions, or (ii) a standard numerical op¬ 
timization algorithm such as a quasi-Newton or conjugate gradient method. 
Second, the approach to approximate conditional expectations: (i) forward 
in time using particle filters or (ii) conditional on the whole set of data using 
more traditional MCMC simulation. 

8 . 2 . Asymptotics of approximate estimators. Theorems 1 and 4 give the 
asymptotic properties of the MLE, but, as noted above, neither the (con¬ 
ditional) likelihood nor the MLE is computable unless the state space is 
finite. An important question is thus if an approximate computation of the 
MLE or likelihood is sufficient to retain the asymptotics. Of course, if 9 n<x0 
is an estimator such that 9 n ^ Xo — 9 n , Xo = op(n~ x / 2 ) (with P = Pg.), then 
9 n xo is consistent and v}^ 2 (9 nm —9*) has the same distributional limit as 
n 1/2 (6 n , x o - 9*). This simple observation applies to methods that directly 
approximate the MLE, for example, MCEM. The following theorem gives a 
corresponding result when the likelihood is approximated. 

Theorem 7. Assume that 9 n ^ xo is an estimator satisfying l n {9 n , x 0 ,xo) > 
sup 0 £ g l n (9, xo) — R n and that the assumptions of Theorem 4 hold. Then the 
following are true: 
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(i) If R n = op(n ) ( with P = P#*), then 9 n , xo consistent. 

(ii) If R n = Op( 1), then n 1 / 2 (0 ni3;o — 9*) = Op(l), that is, the sequence 
{9 n ,x o} *5 n 1//2 -consistent under Pg*. 

(iii) If R n = op(l), then n l l 2 (9 n , Xo — 9*) —» jV(0,1(@*)“ 1 ), Pg*-weakly as 
n —> oo. 

Remark 7. The remainder term does not depend on 9, that is, it is 
uniform in 9 E 0. If n~ 1 R ri —» 0, Pg*-a.s. in (i), we obtain strong consistency. 

Proof of Theorem 7. We start with (i). Since l n (9 n ,x o, x o) > sup eg @ l n (9, 
Rn > l n {9*,x o) — R n , we have 

W > Z(0n,*o) 

>/(0*) — £(0*) + n _1 / n (0*,x 0 ) -n _1 VV,x 0 ;R)) + ^(V,zo) - rT l R n 

> 1(9*) — 2sup \n" 1 l n (9,xo ) — 1(9) \ — n~ 1 R ri . 
flee 

If R n = op(n), using Proposition 2, l(9 UiXo ) — 1(9*) = op( 1). Standard com¬ 
pactness arguments going back to Wald (1949) and Proposition 3 complete 
the proof of (i). 

We now turn to (ii) and (iii). Recall that 9 n ,x 0 maximizes l n (9,x o). By 
a Taylor expansion of l n (9,x o) around 9 n>x 0 , there exists a point 9 n on the 
line segment between 0 n ,x o and 9 n ^ Xo such that 

Rn ft In (9n,x o j ^o) ^n (9n,x o j Xq) — &n ( ^ ^ gin (9n, X Q))&n, 

where e n = n 1 / 2 (0 riiXo — 9 n ^ x 0 ). Since 9 UtXo converges to 9* in probability, so 
does 9 n . Hence there is a positive sequence {<5 n } tending to zero such that 
IV (\9n — 9*\ > S n ) —> 0. Thus, for any c > 0, 

Pg^W-n-'vX^xo) - I(9*)\\ > c) 

<P e *(\9 n -9*\>6 n ) + P e *( sup \\-n- 1 \/ 2 gl n (9,x 0 )-I(9*)\\>c). 

\\8-6*\<5n / 

The first term on the right-hand side tends to zero as n —> oo, and so does 
the second one by Theorem 3. Since 1(9*) is assumed positive definite, there 
exists an M > 0 such that 

Rn > (M + op( l))|e n | 2 . 

Thus, if R n = Op( 1), then e n = Op( 1), and if R n = op(l), then e n = op( 1). 
The proofs of (ii) and (iii) are now complete using n}l 2 (9 n ^ x0 — 9*) = e n + 
nf! 2 (9 nm — 9 *) and the result of Theorem 4. □ 
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Obviously, if l n is an approximation of the true log likelihood l n (both 
conditional on xq) such that \l n {0,xo) — l n (9,x o)| < (1/2 )R n for all 6 £ 0, 
and 9 niXo is the corresponding maximizer, then l n (8 n , x0 ,xo) > l n {9n,x 0 ,xo) — 
(1/2 )R n > l n {9 n ,xo,x 0 ) - (1/2 )R n > l n 0n,x 0 ,^o) - Rn, that is, the principal 
condition of the theorem is fulfilled. We thus see that what is required is 
to approximate the true log likelihood uniformly, and that with increased 
accuracy of the approximation follows improved properties of the resulting 
approximate MLE. Uniform convergence on compacts holds in our case, be¬ 
cause l n (9, xq) is continuous in 6, implied by the combination of so-called epi- 
convergence and hypoconvergence of an approximation l n (9,x o) [see Geyer 
(1994), page 273]. Moreover, Geyer also proved that both of these modes of 
convergence can be obtained by an importance sampling approach, in which 
the unobserved states are simulated using MCMC under a fixed reference 
parameter [Geyer (1994), Theorem 2], Of course, to obtain the required rate 
of convergence of the approximation, with increasing n an increasing number 
of importance samples must be taken. 

Approximation of the log likelihood using particle filters is described, for 
instance, in the above-mentioned paper by Pitt (2002), who also devised a 
method to smooth the approximation to a continuous function; this method 
works for univariate state variables only, however. At present we know of 
no formal proofs that particle filters approximate the true log likelihood 
uniformly, but strongly conjecture that they do under general assumptions. 

8.3. A numerical example. We now turn to a specific numerical example, 
in which we shall employ an MCEM algorithm. Localization and tracking 
of narrow band moving sources by a passive array is one of the fundamental 
problems in radar, communication and sonar [see Ng, Larocque and Reilly 
(2001), Orton and Fitzgerald (2002) and references therein]. This problem 
can be stated as follows. Consider a uniform linear array of d sensors receiv¬ 
ing a narrowband signal from a far-held source with unknown time-varying 
direction of arrival (DOA). Under the classical narrowband array processing 
model the received signal at time k, the d x 1 array observation vector Y k , 
can be expressed as 

(27) W k = Wfc-i + Vk, 

(28) Y k = S k a(W k ) +£k, 

where a(w) = [le lw ■ ■ ■ e i ( d — 1 ) w j T j s the d x 1 steering vector , S k is the source 
waveform, g k is the state noise and e k is the measurement noise. It is as¬ 
sumed that (i) {r] k } are i.i.d. zero mean Gaussian with variance (ii) 
{S k } are i.i.d. zero mean one-dimensional complex circular Gaussian, that 
is, E S k = 0 and ElS)/ 2 = cr 2 and (iii) {£&} are i.i.d. zero mean d-dimensional 
complex circular Gaussian, that is, Ke k = 0 and E e k e£ = a^Id, where x H 
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is the conjugate transpose (or Hermite transpose) of x and I d is the d x d 
identity matrix. This is a hidden Markov model, or state space model, as 
there is no autoregression in the Y’s. We wish to estimate the parameter 
0 = (a 2 ,a 2 ,a 2 ) from the observed data Yi,...,Y n . 

Conditionally on the hidden variable W k , Y& is a Gaussian complex vector 
with density ge(y k \W k ), where 

9Mw) = ^detS(u,) eXp{ ~ i,t ' E ' 1(a,)!;} 

with 


S(tc) = E[Y k Y^\W k = w] = cr 2 a(w)a(w) H + a 2 I d . 
It is easily checked that 


£”>) = - 


v: 


a(w)a(w) H + ~^I d 


1 


°l(da 2 s + o|) 


CJf 


and 


\ogg e {y\w) = —d log it — log (<J 2 ^ d l) (dcr 2 s + of)) 


1 


H n 


—oy y + 


at 


\a{w) H y\ 2 . 


erf 


a 2 e (da 2 + a 2 e ) 


Furthermore, with rg denoting the transition density of {W k }, 

\ogrg(w,w') = logr a 2 (w,w') = -]^\og{2-Ka 2 ) - - w) 2 . 


The above model is equivalent to an HMM on a compact state space. 

Indeed, identify the interval [0, 2n) with the unit circle, which is a compact 
set, and put X k = W k vnod2 , K. It is then clear that {X k } is a Markov chain 
on [0,27 t), with transition density q a 2 (x,x r ) = Y^zL-oo r <r 2 ( x i x ' + 2vr£). The 
output density stays the same, that is, the conditional density of Y k given 
X k = x is ge{y\x). It is easily verified that the HMM {(X k ,Y k )} satisfies the 
regularity conditions in the previous sections. 

Let 9 and 9' denote two (potentially) different parameter values. The 
EM algorithm involves iterative maximization of the function Q(9,9') = 
Ee[logp 0 /(X", Y\ 1 \Xq = xo)|Y”,Xo = xo]. Specifically, if 9 p is the result of 
the pth iteration, then 0 p +i is the maximizer (in 9') of Q(9 P ,9'), that is, 

9 P+ 1 = argmaxg, Q(9 P , 9’). For the present model, put (3(9) = J2 k =i ^d[\ a (X k ) H Y k \ 2 \ Y", Xq 
xo] • It is then straightforward to verify that the maximizer of the M-step of 
the EM algorithm is the triple (d 2 , d 2 , a 2 ) given by 


(29) 


d 2 = argmax„E 0 


E log q v (X k - U X k ) Y[\ X 0 = xo 


Lfc=l 
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,2 P(0p) ~T,k=l\ Y k\ 2 

m a ° - nd(d-l) ’ 

^ ~ 2 zu\Yk\ 2 - P(e p )/d 

[6i) n(d — 1) 

The conditional expectation /3(d) cannot be explicitly computed, let alone 
the expectation required to compute a p . We note that we could also employ 
the representation with {} to simplify the implementation of this part 
of the M-step and the MCMC algorithm below, as there is then a sufficient 
statistic for the re-estimation of a* as well, but this approach gives us less 
satisfying numerical results. We also note that although qg(x,x') is not avail¬ 
able in closed form, it is straightforward to approximate it by a truncated 
sum as rfj 2 (x. x’ + 2ir£) decays rapidly as \i\ —> oo. 

In the MCEM approach, the conditional expectations above are replaced 
by sample means over a number of realizations of X”, conditional on Y” and 
Xq = xo, obtained by Monte Carlo simulation. At each iteration p we draw a 
sample of size m p of an R n -valued Markov chain with stationary 

distribution Pe (Xlf E -|Y ”,Xo = xq). Many different solutions are available 
at this stage; in the simulations below, we use a random scan Metropolis- 
Hasting algorithm with transition kernel from X^~ = x to X^ = x' de¬ 

fined in the following way: 

1. Choose a time index i uniformly on {1,..., n}. 

2. Simulate x” ~ qg p (xi- 1 , ■). 

3. Set x! = x (these are M n -valued) and update the zth component of x 1 , 
that is, x\. to x" with probability 

1 A Qe P ( S: i~i^i)qe p (xi,x i+1 )gg p (Y i \x”) ge p (xi-i,Xi) 

qe p (xi- i, Xi)qg p (xi, x i+ i)g 9p (Y* |i») qg p (x*_i, x”) 

1 qe p (x'l,Xi +1 )ge p (Yi\x'l) 
ge p {xi,x i+1 )ge p iXi\xi)' 

If i = n, this acceptance probability is modified to 

1A qe p (xi-i,x , i)ge p (Y i \x , f) qe p (xi-i,Xj) _ ge p (Yj\x'l ) 

qe p {xi-i,Xi)ge p (Yi\xi) qe p (xi- i,x") ge p (Yi\xi)' 

To guarantee convergence of the algorithm, the number of samples, rn p , 
should either be increased at each iteration or be selected in a data-driven 
manner at each iteration [see Booth and Hobert (1999) or Booth, Hobert and Jank 
(2001)]. For simplicity we did not implement such mechanisms but rather 
used a fixed large number of iterations at each step of the algorithm. 

We simulated a single sample of size n = 200 from the model (27)-(28) 
with d = 4 and with the true value of the parameter 9 = (cr^, al, cf) being 
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6* = (0.25,0.64,0.36). At each step of the MCEM procedure we generated 
a sample of size 40,000 by the random scan Metropolis-Hasting algorithm, 
after a burn-in of 20,000 iterations. The acceptance rate of the algorithm was 
about 40%. The re-estimation of cr 2 as in (29) was carried out by numerical 
optimization, and, in order to save computation time, of the total of 40,000 
replications only every 400th was used for the corresponding sample average 
(i.e., 1,000 replications). The stationary distribution of {X^} is the uniform 
distribution on [0, 2n), whence we fixed the initial state xq to its mean n. 
We remark that in this particular case the stationary distribution does not 
depend on 6, whence it could have been employed in the algorithm. We 
started the MCEM algorithm from the true parameters as well as from 
four randomly chosen initial points for which each cr 2 -parameter was drawn 
independently from a uniform distribution on (0,1). For each of the five 
initial points we ran the algorithm for 50 iterations. Figure 1 shows the 
trajectories for each initial point and parameter. Obviously, irrespective of 
the initial point the algorithm quickly finds the same approximation to the 
MLE, although the trajectories do not converge as the sample size m p in the 
algorithm stays bounded. The trajectories for er 2 fluctuate a little more since, 
as described above, only 1,000 replications were used for its re-estimation. 

Next we estimated the observed information, that is, the negative Hessian 
of the log likelihood. We departed from the missing information principle 
(22) and again replaced the expectations involved by sample means over 
simulated replications of X” given Yf and Xq = xo obtained in the same 
way as above. Our approximation to the MLE, 9 say, used for these compu¬ 
tations was taken as the sample mean of the last 25 values of the trajectory 
obtained for the second randomly chosen starting point mentioned above; it 
was 6 = (0.2793,0.5756,0.3466). After running the Metropolis-Hasting algo¬ 
rithm for a burn-in of 100,000 iterations we used another 200,000 iterations 
for the sample means. The resulting approximation of the observed infor¬ 
mation and its inverse were 

/ 203.2 -3.908 56.10 \ 

1 = -3.908 449.1 177.7 , 

V 56.10 177.7 4169 / 

/ 4.941 0.070 -0.070 \ 

I" 1 = 10 -3 0.070 2.266 -0.098 . 

V-0.070 -0.098 0.245 J 

The corresponding approximate 95% confidence intervals are (0.1416,0.4171), 
(0.4823,0.6689) and (0.3159,0.3773) for cr 2 , <r 2 and cr 2 , respectively, and we 
see that they all cover the respective true values. We see that the variations in 
the MCEM estimates in Figure 1 are considerably smaller than the widths of 
the confidence intervals, which indicates that the MLE is well approximated 
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and hence that the inverse observed information matrix is a good estimate 
of the covariance matrix of the approximate MLE as well. Obviously the 
widest interval is that for cbj, which is not surprising, as this parameter is 
associated with the hidden state alone and hence, loosely speaking, “less 
observable” than the other ones. A simultaneous test for Hq :9 = 9* can be 
carried out by computing the test statistic x 2 = (0 — 9*) T 1(6 — 9*), which 
approximately has a x 2 distribution with 3 degrees of freedom under the 
null hypothesis. We found x 2 = 3.065 and the corresponding p-value is 0.38. 
The null hypothesis could thus not be rejected. 

APPENDIX 

A.l. Proofs of technical lemmas. 

Proof of Lemma 3. Assume m ! > m . Note that 
p e (Yfc | Y^ 1 , X. m = x)-p e (Y k | Y*" 1 ,, X_ m , = x') 



<Tj and erg for five runs of the MCEM algorithm, starting from five different initial points. 














38 


R. DOUC, E. MOULINES AND T. RYDEN 


9 e(Y k \Y k _ 1 ,x k )qg(x k ^ 1 ,x k )ii(dx k ) 


x IP# (dx k —\ | X_ m X— m , Y"_ m )d x (dx— m ) 


g e (Y k \Y k _ 1 ,x k )qg(x k - 1 ,x k )p(dx k ) 


x \X— m — X — m , Y_ 


■fc-i 


'x- m \Y k _2,,X_ m i =x') 


Hence, by Corollary 1, 


(32) 


\pg(Y k \Y k _n,X- m = x)-p e (Y k \Y k _^,X_ ml =x')\ 


<p k+m 1 cr + J ge(Y k \Y k _ 1 ,x)p(dx). 


Similarly we have 

p(Y k \Y k _~J l ,X_ m = x) 


(33) 




= / / go(Y k \Y k _ 1 ,x k )q 6 (x k - 1 ,x k )iJ,(dx k )F 0 (dxh-i\Y_ m ,X- m = x) 


> (T- 


gg(Y k \Y k _ 1 ,x)p(dx). 


The proof of (12) is concluded as in Lemma 2, and (13) follows by setting 
nn! = m and integrating w.r.t. Fg(dx- m \Y_ m ) in (32) and (33). To prove 
(14), notice that, by (33), 

a-b_(Y k ,Y k _ 1 )<p 0 (Y k \Y k rJl,X_ m = x) < b+. □ 


Proof of Lemma 4. We will first prove that for any fixed x G X and 
any m, Aq ,m,x(Q) is continuous w.r.t. 9. We have 


pe(Y 0 \Y_l n ,X_ m = x ) 
where, for j S {—1,0}, 


Pdi'Y-m -(-1 ^—mi -Y—m — %) 

Pd(Y _ m _| Y_ m , X— m = x) 


(34) 


m+1 ^—mi -Y—m — %) 



X II 9 o(Yi\Yi- 1 ,x i )p®( m+3 ' ) (dx l _ m+1 ). 

i=-m -\-1 
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Thus pg( Y^_ m+1 |Y_ m ,X_ m = x) is continuous w.r.t. 6 by continuity of qg 
and gg and the bounded convergence theorem; the integrand is bounded 
by {a + b + ) m+ i. Since {Ao im , x (0)} converges uniformly w.r.t. 0 6 0, Pe*-a.s., 
Ao,oo($) is continuous w.r.t. 0 6 0, Pg*-a.s., and the proof follows using 
Lemma 3 and the dominated convergence theorem. □ 


Proof of Proposition 2. By Lemma 2 it is sufficient to prove that 
limsupsup \n~ l l n {6) — Z(0)| = 0, P^t-a.s. 

n —>cxd 0G0 

Furthermore, since 0 is compact, we only need to prove that for all 0 6 0, 

limsuplimsup sup \n~ l l n {6') — l(9)\ = 0, Pe*-a.s. 

5—>0 rwoo \0'-6\<8 

Decompose the difference as 

limsuplimsup sup \n~ 1 l n (0 r ) — l(9)\ 

<5^0 n—> oo \0i-e\<8 

= limsuplimsup sup \n~ 1 l n (d l ) — n~ 1 l n (6)\ 

<5^0 n—>oo \g>-g\<8 

< A + B + C, 


where 


n 

A = lim sup lim sup sup ri 

6-> 0 n-*oo \o f -0\<8 k=1 

n 

B = limsuplimsup sup n -1 V |A fcjOO (0') - A fcjOO (0)|, 

8^0 n—roo \e>-e\<6 k=1 

n 

C = limsupn -1 ^ |A fc)OO (0) - A fciO (0)|. 

TWO ° k=l 

The terms A and C are zero by Corollary 2, and by the ergodic theorem 
and Lemma 4, 


B < lim sup lim sup n sup | A koo (6') — A/ CjOO (0)| 

5—>0 n-»cxD k=1 \0'-G\<8 


= lim sup Eg* 
5—>0 


sup |A O)O o(0') - A O)OO (0)| 

l \ 6 '- 0\<8 


= 0, Pfl*-a.s. 


□ 


Proof of Lemma 5. We will show that for all l > 0 
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By stationarity, this implies the statement of the lemma. 
First recall that z s = (x s ,y s ,..., y\) and note that 


pe{z s \y-i) = Y[qe{xj- 


3= 1 


i,Xj)ge(yj\xj,y j -i)Fg(dx 0 \y 0 _ i }iJ,^ s l \dx{ x ) 


< o + 



X\9e{yj\xj,yj-i) 

3 = 1 


x P g (dxo\y-i)^ s x \dx{ x ) 


= a + h g (z s ), 

say, where hg(z s ) implicitly depends on yo, but not on i, and integrates to 
unity (it is a density w.r.t. r<8> v). Furthermore, 

< JJ Pe{Y^ +l \ z k-l)\B k '~ S ~ 1 (z s ,dZk-l) — !Tg(dZk-l)\ 
x pg(z s \Y°_ i )(p <g> P)(dz s ) 

<b e + a + J ||n fc-s-1 (z a , ■) -Trg\\Tvhg(z s )(y,® v){dz s )\ 

the bound on pg(Y^ +l \zk-i) follows as in (34). Now (35) is a result of the 
above, (3) and dominated convergence. □ 


Let, for 0 <k<m, 

U k ,m(0) 4 logii 0 (Yf|Y 0 , YI^), U(0) 4 logp 0 (Y?|Yo). 

Proof of Lemma 7. It is enough to show that, for all 0 6 0, 


(36) 

Put 

Then 


lim Eg* 

k —>00 


sup I Uk,m(0) -U{9 )I 


m>k 


= o. 


Ak,m — Pe( Y _ s _|_ 1 1 Y_ m ), 


Bk,m =P6>(Y 0 |Y_; 


— k 


A = pg(Y P _ s+1 ), 

B = pg( Y 0 ). 


F-fc 


|p 0 (Yf|Yo,Y_J-p 0 (Yf|Y o )| = 


< 


A k,m A 

Bk,m B 
B\A k: m A\ + A\B k: m ~ B | 


BB 


k,m 


(37) 
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By conditioning on (Al_ s ,Y_ s ) [cf. (34)] and utilizing (Al)(b), it follows 
that 


(38) 


B>a s _ n 9o{Yi\Yi— 1 , . . . , 4 —s+ 15 V— si ■ ■ ■ j Vi— si x) 

J i=- s +l J 

x ix(dx)F e (Y_ s e dy_ s ) > 0, Pe*-a.s. 


Hence, by Lemma 5, with Pg*-probability arbitrarily close to 1, B krn {u) is 
uniformly bounded away from zero for m> k and k sufficiently large, and 
Lemma 5 and (37) show that 

lim sup |p e (Y^|Y 0 , Yj)J -p 0 (Y^[Y o )| =0 in F g * -probability. 

k^>oo m>k 

Using the inequality | logx — logy| < \x — y\/(x A y) and (38) once again, we 
find that 


lim sup \U k)m {9) — U(9) | =0 in Pg*-probability, 

fc-KX) m > fc 

and (36) follows using dominated convergence provided 




sup sup \U kjm (9)\ 

. k m>k 


< OO. 


This expectation is indeed finite since ^(Y^Yo, Y_E is bounded from 
below by a p _ Jli 6-(Yj_i,Y)) and from above by ( a + b + ) p [cf. (34)], and the 
logarithms of these bounds are in L 1 (P^*). □ 


A.2. Proof of Proposition 4. We preface the proof with several lemmas. 
For convenience, Proposition 4 will be proved for q = 1. Adaptations to 
general q are obvious. 

Define for k > 1, m> 0 and x 6 A, 


A-k,m,x(9) — IE# 


E 




> X—m — X 


- 


k -1 

E <p(8,Zi) \Y k _~lx_ m = x 


Along the same lines as in Lemma 9, for m, n > 0 and 0<A:<n + m— 1, 
^e{Xn-k S A|X"_ fc+1 , Y_ m , AU m = x) 

= F g (X n _ k G A\X n _ k+ i,Y n L^,X- m = x) > —(l k (Y n _~^,X- m = x,A), 
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where p k {Y n _^,X_ m = x, ■ 
particular implies that 

||P e pf; G 

(39) 


) is a probability measure. The result above in 
= x)-¥ e (Xi e ■\Y n _~^,X. rn = x)\\T V <P n_i_1 . 


Lemma 13. Under the assumptions of Proposition 4 there exists a ran¬ 
dom variable K E L 1 (P^*) such that, for all k > 1 and 0 < m < m', 


(40) 

sup sup Ai. miI (fl) - Afc im (0)| < K{k V m) 2 p (fc+m)/2 , 
xeA'eeG 

Pg* -a.s. 

(41) 

sup sup I A km ,x(0) ~ &k,m',x(0)\<K{k\/ m) 2 p ( ' k+m)/2 , 
x&xe&G 

Pe* -a.s. 


Proof. The proof is along the same lines as the proof of Lemma 10, 
using (39). Put Halloo = sup xgi ^ sup 0 G G \<p{Q,x,Yf)\. Combining the relations 

\Ee[ip(e,Z i )\Yt m ,X_ m = x\-E e [ l p(e,Z i )\Y k _ m }\ <211^11^+-, 

|E e mo, Zf) \Y k _ m , x_ m = *]-!„ [,p(e, Zf) | Y *" 1 , x_ m = x \ \ 

< 2\\ip i \\ 00 p k ~ l ~ 1 , 

< 2 ||^|| ooP fe -- 1 , 

we obtain 


\^k,m,x(0) ^k,r 


— 4 Halloo 


(, p i+m A p k ~ 1 - i ) 


<4 max ||^j||oo 

—m<i<k 


J2 {p i+m A / 0 fc_1_ *) 


<4 E (1*1 v1 )' 


< 8 (k V m ) 2 ^ 


i=—m -\-1 

—-—II 

(N vi) 2 " 

—-—| 

, (1*1 vi) 2 


3illoo( p k 1 l + 5 Z p 

V i<(k—m— 1)/2 z>(/c—m—1)/2 


z+m I 


p 


(k-\-m— 1)/2 


i-p ’ 


which proves the first part of the lemma. 

For the second part we also use the bound 

|Ifl MO, Zf)\Y k _ m , X_ m = s] - 1 oMO, Zf)\Y k _ m ,, x_ m , = x\ I 


z+mAm' 
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to obtain 

— ^k,m' 

k —m 

<4 5 : ll^lloc(p i+m A/- 1 - i ) + 2 E Halloo/- 1 " 4 . 

i=—m +1 i=—m ’+1 

Here the first term on the right-hand side is bounded as above. Since —i/2 < 
(k — m)/2 — i for i < — m, the second term can be bounded as 

—m —m 

2 e yi\\ocP k - 1 - i <2^ k+m ^ 2 e iNiooP^ 72 - 1 -' 


< 2 p( k + m )/ 2 E 


—i/2—1 


i=-m '-\-1 
oo 


<2/3 (fc+m)/2 E 




,14/2-1 


and the proof is complete. □ 


By Lemma 13, for all x G X and k > 1, {A^. m]3 ,(0)} m >o converges uni¬ 
formly w.r.t. 9 G G EV-a.s. and in L^Pg*) to a random variable that we 
denote by Afc iOO (0); by (40) this limit does not depend on x. Lemma 13 also 
immediately implies that 

n 

n -1 E su P|Afc,o(#) - A fc!OO (0)| ->0, P e *-a.s. and in L 1 (P 6 »»). 

k=i 6eG 


Lemma 14. Under the assumptions of Proposition 4, for all x £ X and 
m> 0 the function 9 > Ao im>x (0) is P g*-a.s. continuous on G. In addition, 
for all 9 G G and all x & X, 


lim Pa* 

<5^0 


SUp |Ao) Ao,m,i(^)| 

L \d'—6\<& 


= 0 . 


Proof. Note that |A o , m ,x(0)| <2£° = „m+i INloo- Thus, under the as¬ 
sumptions of Proposition 4, Ao , m , x (9) is uniformly bounded w.r.t. 9 by a 
random variable in L 1 (Pg*). It hence suffices to show that for — m < i < 0, 

lim sup |E 0 / [ip(0', Zf) |Y° m , A_ m = x] 
s ^°\e'-e\<s 

- E 0 [<p(0, Zi)\Y°_ m , X_ m = x]\ = 0, Pe-a.s. 
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Write 

(42) 


E e [v(6,Z l )\Y^ m ,X- m = x\ 

= J v{0,Xi,Yi)po(X i = x i \Y 0 _ m ,X_ m = x)ii(dxi) 


and note that for all Xi, ip(9, Xi, Yj) is continuous w.r.t. 9 and that this factor 
is bounded by Halloo < oo. Moreover, 


Pd(-Y-i — Xi\ Y_ m , X— m — x) — 


P8 (-Y* — Xi , Y_ m , 1 |Y_ m , X— m — x ) 


Pe(Y°_ 


m +1 


Y jyi . X-. m — x ) 


Here f> 0 (Y° m+ 1 |Y_ m ,X_ m = x) is continuous w.r.t. 9 (see the proof of 
Lemma 4), and using (34) we find that this density is bounded from be¬ 
low by 


o 

n 

i=—m+ 1 ' 


gg(Yi\Yi_i,Xi)p(dxi) > 0 


uniformly w.r.t. 9. In a similar fashion pe(Xi = Xi, Y°_ m+ 1 |Y„ m ,X_ m = x) 
is continuous in 9 and bounded from above by (cr + 6 _|_) m . We conclude that 
po{Xi = Xi |Y_ m ,X_ m = x) is continuous in 9 and bounded from above uni¬ 
formly w.r.t. 9. Hence the integrand in (42) is continuous in 9 and bounded 
from above uniformly w.r.t. 9. Dominated convergence shows that the left- 
hand side of (42) is continuous in 9 and the proof is complete. □ 


By Lemma 13 Ao , m ,x(9) is a uniform Cauchy sequence w.r.t. 9 P^t-a.s. 
and in L 1 (P g*), and by Lemma 14 Ao,m,a ; (0) is continuous w.r.t. 9 on G P^*- 
a.s. and in L 1 (P^*) for each m. Hence it follows that Aq jOC ( 9) is continuous 
w.r.t. 9 on G Pg*-a.s. and in L 1 (Pg*), that is, for each 9 € G, 

(43) lim sup |Ao,oo(^ / ) — Ao i 0 o(#)| = 0, Pg*-a.s. and in L 1 (Pg)*). 
5 - > o | 0 '- 0|<<5 


Remark 8. It is important to stress at this point that the result above 
does not imply that Ao , m (9) is continuous w.r.t. 9 because, contrary to 
JP, we do not assume any kind of regularity condition for the stationary 
distribution as a function of 9. Nevertheless, we have proved above that 
Aq,oo( 9) is continuous. 


We may now prove a locally uniform law of large numbers. 


Lemma 15. Under the assumptions of Proposition f, for all 9 £ G, 

n 


lim lim sup 

<5—,.on—I>oo | 6 )/_ 6 )|< ( 5 


n- 1 EAfc 1 oo(0 , )-E < ,.[A Ol ooW] 

k =1 


= 0 , 


Pfl* -a.s. 
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Proof. Write 


sup 

\e'-e\<5 


-i 


n 


E Aft, 00 ( 0 ') — Eg* [A 0)C 


fc=i 


< sup 
|e»'-e)|<5 


n" 1 ^(A fc;OO (0 / )-A fciOO (0)) 


fc=i 


+ 


n 


-1 


E A fcjOO ( 0 ) — Eg* [A 0)C 


fc=i 


<ra *E sup |Afc )OO (0') — Afc ;OO (0)| 
k=1 \8'-0\<5 


+ 


n- x y A fciOO ( 0 )-E e .[Ao,oo( 0 )] 


fc=i 


As n —> 00 , the first term on the right-hand side tends to 


Eg* 


sup |A 0jOO (^ / ) — A O ,oo( 0 )| , 

8'-9\<S J 


Pg*-a.s., 


an expression which, by (43), vanishes when <5 —> 0. The second term vanishes 
Pg*-a.s. as n —*• 00 by the ergodic theorem. This completes the proof. □ 


We have now at hand all the necessary elements to prove Proposition 4. 


Proof of Proposition 4. Convergence of A k,m{@) and continuity of 
Eg* [Ao,oo( 0 )] have been proved above, so it remains to show the last part of 
the proposition. 

Note that 


Eg 


E 

i— 1 


v(0,Zl q+1 )Y n Ol X o = x o 


n 


E Afc,O,a;o( 60 - 

fc=1 


Letting m' —> 00 in Lemma 13 we find that | A k Q_ X0 (9) — A/ C]OO (0)| < Kk 2 p k / 2 
Pg*-a.s. and hence it is sufficient to prove that 


lim lim sup 

< 5 ^ 0 n— >°° |g/_g |<5 


n 1 E Afe,oo(^0 — Eg* [A 0jC 


fc=i 


= 0, 


Pg*-a.s. 


This, however, is Lemma 15. □ 
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A.3. Proof of Proposition 5. The proof of Proposition 5 closely follows 
the proof of Proposition 4. Only the main adaptations from the proof are pre¬ 
sented. We gather in the following lemma some of the required bounds for the 
conditional covariance. In the proof of Proposition 5 we will consider for con¬ 
venience g = l, and we let 4>e,i = 4>(Q, Zi ) and ||0*||oo — su PeeG su PxeA x > Yj)|. 

Lemma 16. Under the assumptions of Proposition 5, for all m' > m> 0, 
all —m <i,j < n, all 9 £G and all x £ A, 

\covo[fa,i,fa,j\Y'4 m \\ < 2 / o |l_il ||^|| 00 ||^|| 00 , 

\covo[fa,i,fa,j\'Y 1 4 m ,X_ m = x}\ < 2 / o |l_j| ||^ i || 00 ||^|| 00 , 

|cOVg \(f>O t i, <t>9,j | Y_ m , A_ m x\ COVg 4*6 } j | Y_ m ] | 

<6||^||oo||^||oop m+iAi , 

ICOV0 [fa,i , fa ,j |Y^ m ] - cove[fa,i, fa ,j |Y^]| < 61| fa ||oo|| fa\\ 00 p n ~ lWj , 

|cov 0 [< i>e,ii fa. , 3 1Y_ m , A_ m — x] COV 0 [4*0,i ; (foj | Y _ m , X— m — x]| 

< 6||^||oo||</>il|ooP n_ * Vi - 


All these relations stem from Corollary 1, Lemma 9, (39) and observations 
such as, for i < j, 

£ A, Xj £ B\ Y_ m , X_ m = x) 

- V e {Xi £ A\Y n _ m ,X_ m = x)F e (Xj £ B\T4_ m ,X. m = x)| 

= ¥e(Xi £ A|Y_ m , AX m = x) 

x |P e (Xj £ B\ Y^ m , X* £ A, X_ m = x) - P^X,- £ S|Y" m , X_ m = x)| 
< /WL 


Details of the proof are omitted for brevity. 
For x £ A define 


r k,m,x{0) = varfl 

k 

Y k _ m ,x_ m = x 



_ 


V&T0 

- k -1 

Z fa 

Y k 1 X — x 

± —m i —m — ^ 


- i=-m-\-l 

_ 


We again follow the pattern of proof consisting of showing that for each k 
and x £ A, the sequence {Tk,m,x(G)}m>o is a uniform (w.r.t. 9 £ G) Cauchy 
sequence that converges to a limit which does not depend on x. 
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Lemma 17. Under the assumptions of Proposition 5 there exists a ran¬ 
dom variable K £ L l (Pg*) such that, for all k > 1 and 0 < m < m', 

(44) supsup|I\ mi3 .(6»)-T fcm (0)| < K(m + A;)V fe+m)/4 , F 0 *-a.s., 

x&xe&G 

(45) sup sup I^^(0) - r k ,m',x(0)\ < /\(m + /c) 3 p (fc+m)/8 , Pfi*-a.s. 

x£X9eG 


Proof. Let, for a <b, S% = Jfi= a 00,* (th e dependence on 6 is implicit). 
The difference P k ,m,x{Q) — ^k,m{9) may be decomposed as A + 2B + C, where 

A = vaf e [5^ +1 1' Y k _ m , X- m = x] - var e [SE+i| Y^!, X_ m = x] 

- + vafefcVjY^ 1 ], 

B =^ e [S k ff^ +l ,(fe,k\Y k L m ,X_ m = x\ -cm 0 [S^ +1 ,(/) 0jk \Y k _ m ], 

C = xax 0 [4> 0ik \Y _ m ,X_ m = x\ - vw 0 [<j>e,k\Y _ m ]. 


By applying Lemma 16, it follows that 

|A|<2 5Z (2x6 / 9 m+4 A4x2p J -*A2x6p /; “ 3 '" 1 ) 
-m+KKj</c-l 


x max 


-m-\-l<i<j<k—l 

The Cauchy-Schwarz inequality yields 


i 11 oo 11 'rj 11 oo • 


max 


11 oo 11 11 oo 


£ E 


>, i=—m 
k 


(46) 


< E (Nl VI) 2 E 


E d*i vi) 2 "^" 00 


< (m 3 + k 3 ) y~) 


(1*1 VI) 


2 M'rl||oo> 


where the last sum is in L 1 (P@*). Furthermore, for n > 0, 

E (p i K(S- i *P n - i )<2 E E 

0<i<j<n 0<i<n/2i<j<n—i 


<2 E E p”-o E 7-‘ 


0<i<n/2 \i< i 7 <(n+i )/2 


(n+i)/2<j<n—i 


< 


1-p 


E p 

0<i<n/2 


(n—i )/2 
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4 p n / 4 

~ (1 — /=•)(! — P 1/2 ) ' 

This shows that |H| is bounded by an expression as in the first part of the 
lemma. 

Similarly we have 

fc-i 

|-B| < 6 Y (p m+x A p k ~ l ) max ||||oo||||oo- 

— m+l<i<k— 1 

i=-m-\- 1 

For the maximum we can use the bound (46), and for the sum we note that, 
for n > 0 , 

n 9 n n / 2 

a/■-<)= Y P "-‘+ Y ?<{—■ 

i =0 0 <i<n/2 n/2<i<n 

Thus \B\ is bounded by an expression as in the first part of the lemma. 

For C we have \C\ < Qp k+rn \\(t>k\\‘ta, and the proof of the first part of the 
lemma is complete. 

The difference Z k, m ,x(0) — ^ k,m' , x {@) may be decomposed as A + 2B + C + 
D + 2 E + 2 F, where 

A = vaTg[S l !_Y + l\y k -m,X- m =x)- ™ e [SYrt+\\¥ k -mi X -rn = x] 

- vaig[S*M +1 \Y k _ m/ ,X- m ' =x] + ya^ 0 [S^ +1 \Y k J^ l ,X_ rn i =x], 

B =^ 0 [S k _~^ +1 Ae,k\Y k -m^ x -m =x\- rnv e [S k j^ +1 ,(j)e,k\Y k _ m ',X- m ' =x\, 
C = var e [4 ) e,k\ Y mi x -m = x\ - vafe[^ 0 ,fc|Y_ m ,,X_ m / =x\, 

D = vaT g [SZ’Z'+i\Y k - m ', X_ m > =x\- varg [5'I™, +1 1Y^ 1 ,, Y_ m / = x\, 

E = cove [S k _^ + i, SZZ+i f Y-m’, *- m ' = x] 

- cov, [S k _^ +1 , SZZ +1 f Y-^ , X-m> = x ], 

F = cov e [SzZ' + vMY k _ m ,,X_ m , = x\. 

Here |H|, \B\ and \C\ can be bounded as above, using variants of the bounds 
in Lemma 16. 

Before proceeding, we note that for k > 1, m> 0 and i < 0, the following 
implications hold: 

if j <(k + i — l)/2, then (|j| — l)/2 < (3 k + i — 3)/4 — j, 

if (A: + i — l)/2 < j < k — 1, then (|j| — l)/4 < j + (—k — 3 i + l)/4, 
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if i < —m, then |i|/8 < (k — 2i — m)/ 8, 

if i < —m, then 3|i|/4 < {k — m)/A — i. 

Using these inequalities, we can bound \D\ as 

\D\ < 2 £ (6/- 1 - J A2x2^- i )||^|| 00 ||^|| 00 

—m 

<l 2p (k+m-2)/8 Halloo 

X f E P< 3 ‘ +i - 3 )/ 4 -ti*iu 

'i<j<(k+i—l)/2 


+ E p )+(-t- 3 i+l )/ 4||^|| co '\ 

1) /2<i<—m ) 

oo 

Y pd-j'i- 1 )/ 4 ! 


(fc+z—1) / 2 < j/ <—m 
oo 

< 12p (k+m-2)/8 pH/8 


'iWoo / J F 11 'rj 11 oo • 

1— — OO j = — 00 

By the assumptions, the right-hand side has the required form. 
Similarly, 

—m k —1 

\E\< ]T X! (6/- 1 “ i A2x2^- i )||^|| 00 ||^|| 00 

2 =—m'+l j=—m J r l 

—m 

< 6p (k+ m -2)/8 Y p fr-2i-m)/fi ||0.||oo 
i=—ra'+l 


x E 

\ — <(k-\-i— 1)/2 


(J (3 ‘ +i - 3) /4-4||*|U 


+ E ^+M— 3< + l)/4| W |J 

(Ai+z—l)/2<j ; <fc—1 / 


< 6p(t+m -2)/ 8 £ pHI/Sn*^ £ p(W-l)/4| 


F? II OO 


j=-oo 


w< E 2p l_ ‘l 

z=—m'+l 


^2 oo || V-'/c oo 


= 2p (k+m)/A Y p( fe - m ) /4 -i||^||oo/ /2 ||4j 


and 
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< 2p (k+m )/4 p 3 ^/ 4 \ 

i =—oo 

The proof is complete. □ 


^iiioo y ( P 


il/2| 


"J 11oo • 


j=-oo 


Thus {rfc )m)X (0)} m >o is a uniform (w.r.t. 6 S G) Cauchy sequence Pg.-a.s. 
and in L^P#*), and {rfc )miX ( 0 )} m >o converges as m — > oo uniformly w.r.t. 6 
Pe*-a.s. and in L 1 (Pe*) to a random variable Tfc ]OO ( 0 ) G T 1 (P 0 *) which does 
not depend on x thanks to (44). By construction, 


var e 


n 

E 

k =1 


^,Z 


fc— 


Y 0 ,X 0 — xq 


= Vr, 


/ y ^ k,0,xo 
k =1 




and the proof of Proposition 5 follows along the same lines as that of Propo¬ 
sition 4. 
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